bump universal image to support training hub 0.3.0 #502

briangallagher · 2025-11-07T09:51:22Z

Description

bump universal image to support training hub 0.3.0

Summary by CodeRabbit

New Features
- Updated GPU runtime with expanded CUDA/tooling support, RDMA/IB libraries, CUDA toolchain provisioning, explicit entrypoint wrapper and enforced default notebook start (headless mode preserved).
- Upgraded ML/runtime packages and added training-related packages (notable: transformers, bitsandbytes, tokenizers, training-hub, einops, typer and others) for improved training tooling.
- More deterministic dependency builds and permissions fixes for more reliable installs.
Chores
- Pipeline timeout increased to 9 hours.
Documentation
- README now references a "Minimal Workbench" layer.

coderabbitai · 2025-11-07T09:51:42Z

Walkthrough

Updated training image Dockerfiles and Tekton pipelines: runtime Dockerfile revises package versions, adds a deterministic two‑step wheel build and universal entrypoint wiring; a new debug Dockerfile adds CUDA/toolchain, RDMA, diagnostics and wheel-build steps; Tekton PipelineRuns gain 9h timeouts; README minor note added.

Changes

Cohort / File(s)	Summary
Main Dockerfile: training runtime `images/universal/training/py312-cuda128-torch280/Dockerfile`	Replaced direct license copy with a TODO; updated many Python packages and added new dependencies (e.g., einops, kernels, deprecated, typer, training-hub variants); commented out ninja install; added ENV `TORCH_CUDA_ARCH_LIST`; introduced deterministic two-step installs (causal-conv1d, mamba-ssm) with permission fixes; switched to copying `entrypoint-universal.sh` with `--chmod=0755`; added `ENTRYPOINT` and `CMD`.
Debug Dockerfile: workbench + runtime `images/universal/training/py312-cuda128-torch280/Dockerfile.debug`	New file adding a full workbench+runtime: CUDA 12.8 toolchain, RDMA/IB libs, nvcc and CUDA checks, Torch cu128 install, broad ML pip stack, preflight diagnostics, two-step wheel build flow (mamba-ssm), user/permission setup, and universal entrypoint wiring.
Tekton pipelines: timeouts & platform param `.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml`, `.tekton/universal-image-py312-cuda128-torch280-push.yaml`	Added `spec.timeouts.pipeline: 9h`; introduced/propagated build-platform / PLATFORM parameters (defaults) into pipeline and build-container steps.
Documentation `images/universal/training/py312-cuda128-torch280/README.md`	Added a bullet "Minimal Workbench" to the list of included layers.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Builder as Docker build
  participant Preflight as Preflight checks
  participant Fetch as Binary wheel fetch
  participant Build as Two-step wheel build
  participant Pip as pip install
  participant Image as Final image

  rect `#F0F8FF`
    Builder->>Preflight: run diagnostics (nvcc, indexes, TORCH_CUDA_ARCH_LIST)
  end

  rect `#F5FFE6`
    Preflight->>Fetch: attempt binary wheel install
    Fetch-->>Preflight: success / not available
  end

  alt binary available
    Fetch->>Pip: install binary wheels
  else build required
    Preflight->>Build: deterministic two-step build (mamba-ssm, permissions, heartbeat, timeout)
    Build->>Pip: install locally-built wheels
  end

  Pip->>Image: finalize (set ENV, COPY entrypoint, set ENTRYPOINT/CMD)
  Image-->>Builder: image produced

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Review focus:
- Two-step wheel build scripts and failure/timeout/heartbeat logic in Dockerfile(s).
- CUDA/toolchain and RDMA install ordering, ldconfig and LD_LIBRARY_PATH updates.
- ENV/ARG interactions (TORCH_CUDA_ARCH_LIST, ccache, PYTHONUSERBASE/PYTHONNOUSERSITE).
- Tekton timeout and PLATFORM parameter propagation.
- Entrypoint-universal.sh addition, COPY --chmod and ENTRYPOINT/CMD behavior.

Poem

🐰 I hopped through Dockerfiles by moonlight,
Wheels spun with ccache and CUDA bright,
Heartbeats hummed while builds held strong,
LD paths aligned to join the song,
Entrypoint set — the training rabbit's flight!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and specifically reflects the main change: updating the universal image to support training hub 0.3.0, which is the primary objective evident throughout the changeset.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9739002 and 33505ca.

📒 Files selected for processing (1)

images/universal/training/py312-cuda128-torch280/Dockerfile (4 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Red Hat Konflux / universal-image-py312-cuda128-torch280-on-pull-request

🔇 Additional comments (4)

images/universal/training/py312-cuda128-torch280/Dockerfile (4)

163-168: No issues found; entrypoint script is properly implemented.

The entrypoint-universal.sh script exists and correctly implements the required conditional routing logic. It properly routes between workbench mode (when NOTEBOOK_ARGS is set, falls back to start-notebook.sh) and headless runtime mode (execs provided command), with appropriate error handling when neither is available.

122-122: transformers 4.57.1 is compatible with PyTorch 2.8.0. The library requires torch >= 2.2, so it works with PyTorch 2.8.0. No compatibility issues to address.

139-145: Verify training-hub 0.3.0 is accessible and its dependencies are documented.

The training_hub project is available from the Red Hat AI Innovation Team, but web search shows training-hub 0.2.0 as the latest published version on PyPI—training-hub 0.3.0 does not appear to be publicly released. The packages being added (einops 0.8.1, kernels 0.10.3, instructlab-training 0.12.1, rhai-innovation-mini-trainer 0.3.0) should be verified as explicit dependencies of training-hub 0.3.0 rather than assumptions.

Confirm:

Whether training-hub 0.3.0 is pulled from a development branch or internal repository

That these packages (especially instructlab-training 0.12.1, which is significantly older than current versions) are intentional and compatible with the training-hub 0.3.0 target

157-160: Remove --no-deps or switch to documented installation method for mamba-ssm.

The 2-step installation pattern (causal-conv1d first, then mamba-ssm) aligns with official documentation. However, using --no-deps on mamba-ssm deviates from the documented procedure, which instead recommends:

pip install mamba-ssm==2.2.6.post3 (without --no-deps), or

pip install "mamba-ssm[causal-conv1d]==2.2.6.post3" (using extras syntax).

The --no-deps flag is not mentioned in the official installation instructions, and it assumes all transitive dependencies are already installed in the earlier pip block. This assumption is not explicit and could silently omit critical packages.

Recommendation: Either remove --no-deps to let pip resolve dependencies automatically, or test the container build to confirm all runtime dependencies are satisfied.

images/universal/training/py312-cuda128-torch280/Dockerfile

astefanutti · 2025-11-07T10:20:30Z

images/universal/training/py312-cuda128-torch280/Dockerfile

 # Install remaining runtime packages (resolved from default PyPI), including FlashAttention
 # Note: We intentionally do not use a Pipfile/lock here to avoid mixing resolvers with the base (uv lock),
 #       to control CUDA/FA install order and indexes, and to reduce lock churn across arches/ABI-specific wheels.
 RUN pip install --retries 5 --timeout 300 --no-cache-dir \


Possibly in a separate PR, should we start adding kubeflow-sdk and kfp?

kubeflow-sdk is here - again, this is temp, needs to be updated.

I'll add kfp, the imminent 2.15 release has some functionality in it that we will want, so might need to wait for that.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

images/universal/training/py312-cuda128-torch280/Dockerfile (1)
157-160: Document rationale for deterministic two-step install approach.

Lines 157–160 implement a specific install sequence for causal-conv1d and mamba-ssm with --no-build-isolation and --no-deps flags. While this pattern may align with runtime behavior, add a comment explaining:

Why causal-conv1d must be installed first

Why mamba-ssm requires --no-deps

Whether this order prevents dependency conflicts or resolves specific build issues
~# Deterministic 2-step: sub-dep first, then parent without deps (align with runtime)
+# Deterministic 2-step: sub-dep first, then parent without deps (align with runtime)
+# causal-conv1d must be installed before mamba-ssm to ensure it is available during mamba-ssm build.
+# mamba-ssm uses --no-deps to prevent pip from installing a conflicting version of causal-conv1d.
 RUN pip install --retries 5 --timeout 300 --no-cache-dir --no-build-isolation causal-conv1d==1.5.3.post1 && \
     pip install --retries 5 --timeout 300 --no-cache-dir --no-build-isolation mamba-ssm==2.2.6.post3 --no-deps && \
     fix-permissions /opt/app-root -P

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a5aa67d and 3416b2c.

📒 Files selected for processing (1)

images/universal/training/py312-cuda128-torch280/Dockerfile (3 hunks)

🔇 Additional comments (2)

images/universal/training/py312-cuda128-torch280/Dockerfile (2)

163-168: Entrypoint script verified and correctly configured.

The script exists at the specified path, contains a valid shebang, and will be properly executable in the container via the --chmod=0755 directive in the COPY command. The container will start correctly.

22-22: No issues found after verification.

The required files (LICENSE.md and entrypoint-universal.sh) exist in the repository. Package versions are available on PyPI: training-hub 0.3.0 is published and compatible, bitsandbytes 0.48.1 includes CUDA 12.8 compatible wheels, and all dependency versions are properly specified.

images/universal/training/py312-cuda128-torch280/Dockerfile

openshift-ci · 2025-11-07T10:34:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [astefanutti]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

astefanutti · 2025-11-07T10:34:18Z

Thanks!

openshift-ci · 2025-11-07T12:49:08Z

New changes are detected. LGTM label has been removed.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

images/universal/training/py312-cuda128-torch280/Dockerfile (1)
137-137: Upgrade bitsandbytes to 0.48.2 — version 0.48.1 lacks CUDA 12.8 wheels

bitsandbytes 0.48.1 has no CUDA 12.8 wheels available on PyPI (0 cu128 wheels found). This will cause runtime failures when the container attempts to load libbitsandbytes_cuda128. Upgrade to bitsandbytes 0.48.2, which has confirmed cu128 support.
-    bitsandbytes==0.48.1 \
+    bitsandbytes==0.48.2 \

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between acaff12 and e24db03.

📒 Files selected for processing (1)

images/universal/training/py312-cuda128-torch280/Dockerfile (3 hunks)

🔇 Additional comments (2)

images/universal/training/py312-cuda128-torch280/Dockerfile (2)

163-163: No issues found. All required files exist and are correctly configured.

Both LICENSE.md and entrypoint-universal.sh exist in the repository. The entrypoint-universal.sh script correctly implements the intended behavior:

Default mode (no args): runs start-notebook.sh via the entrypoint wrapper

Headless mode (with command): the wrapper executes the provided command

The ENTRYPOINT and CMD configuration at lines 167-168 properly support this dual-mode behavior.

139-140: I need to check what distribution files are actually available for each package, since the absence of cp312 wheels doesn't necessarily mean they won't work.

The web search results reveal important version discrepancies and accessibility issues. Let me verify the kernels package version specifically:

All new packages are compatible with Python 3.12; review concern is not valid.

einops 0.8.1 has a universal wheel (einops-0.8.1-py3-none-any.whl), and kernels version 0.10.3 exists on PyPI (released Oct 13, 2025) with Python >=3.9 requirements. Both packages use universal wheels (py3-none-any format) which are compatible with Python 3.12—specific cp312 wheels are unnecessary for pure Python packages. instructlab-training supports Python 3.12. The script's finding of "0 Python 3.12 wheels" reflects the absence of version-specific binary wheels, not actual incompatibility; universal wheels work across all Python 3.x versions.

coderabbitai · 2025-11-07T17:14:37Z

images/universal/training/py312-cuda128-torch280/Dockerfile

+# Deterministic 2-step: sub-dep first, then parent without deps (align with runtime)
+RUN pip install --retries 5 --timeout 300 --no-cache-dir --no-build-isolation causal-conv1d==1.5.3.post1 && \
+    pip install --retries 5 --timeout 300 --no-cache-dir --no-build-isolation mamba-ssm==2.2.6.post3 --no-deps && \
+    fix-permissions /opt/app-root -P


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Clarify the rationale and verify completeness of the 2-step deterministic installation.

The 2-step installation of causal-conv1d followed by mamba-ssm with --no-deps assumes that causal-conv1d satisfies all of mamba-ssm's dependencies. While this pattern can improve determinism, it bypasses pip's dependency resolution, which carries risk if mamba-ssm has other undeclared dependencies.

Verify that causal-conv1d 1.5.3.post1 is the only missing dependency of mamba-ssm 2.2.6.post3:

Also document in a comment why this explicit 2-step approach is necessary (e.g., "avoids pip resolver conflicts with training-hub" or similar).

🏁 Script executed:

#!/bin/bash # Check mamba-ssm dependency metadata echo "=== Checking mamba-ssm 2.2.6.post3 metadata ===" curl -s "https://pypi.org/pypi/mamba-ssm/2.2.6.post3/json" | \ python3 -c " import sys, json data = json.load(sys.stdin) info = data.get('info', {}) print('Requires:') requires = info.get('requires_dist', []) if requires: for req in requires: print(f' - {req}') else: print(' (no metadata available)') " # Also verify causal-conv1d version echo "" echo "=== Checking causal-conv1d 1.5.3.post1 availability ===" curl -s "https://pypi.org/pypi/causal-conv1d/1.5.3.post1/json" | \ python3 -c " import sys, json try: data = json.load(sys.stdin) print('✓ causal-conv1d 1.5.3.post1 exists') except: print('✗ causal-conv1d 1.5.3.post1 not found') "

Length of output: 175

Remove the --no-deps flag and declare causal-conv1d explicitly in project requirements instead.

causal-conv1d is listed as an optional component for mamba-ssm, not a required dependency. Using --no-deps bypasses pip's dependency validation, which prevents verification that the installed versions are compatible. If causal-conv1d is not installed, mamba-ssm falls back to an eager implementation that is slower but functional.

Replace the 2-step pattern with a straightforward install:

RUN pip install --retries 5 --timeout 300 --no-cache-dir --no-build-isolation causal-conv1d==1.5.3.post1 mamba-ssm==2.2.6.post3

Or declare causal-conv1d==1.5.3.post1 in your training-hub requirements file so pip resolves both transitively with proper version constraints.

🤖 Prompt for AI Agents

In images/universal/training/py312-cuda128-torch280/Dockerfile around lines 157 to 160, the RUN uses pip install mamba-ssm with --no-deps which bypasses dependency resolution and prevents pip verifying compatibility with optional causal-conv1d; remove the --no-deps and either install both packages together in one pip invocation (e.g., pip install causal-conv1d==1.5.3.post1 mamba-ssm==2.2.6.post3) or add causal-conv1d==1.5.3.post1 to the project training-hub requirements so pip can resolve versions transitively and ensure compatibility.

briangallagher · 2025-11-07T20:32:02Z

/retest

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

images/universal/training/py312-cuda128-torch280/Dockerfile (1)
73-76: Avoid referencing environment variables within the same ENV statement (Hadolint DL3044).

While Docker expands variables in this context, it is clearer and follows Dockerfile best practices to split the definition into separate statements.

Apply this diff to separate the ENV statements:
-ENV CUDA_HOME=/usr/local/cuda \
-    PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \
-    LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \
-    TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0"
+ENV CUDA_HOME=/usr/local/cuda
+
+ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \
+    LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH} \
+    TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0"

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between da73149 and 9793396.

📒 Files selected for processing (3)

.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml (1 hunks)
.tekton/universal-image-py312-cuda128-torch280-push.yaml (1 hunks)
images/universal/training/py312-cuda128-torch280/Dockerfile (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

.tekton/universal-image-py312-cuda128-torch280-push.yaml
.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml

🧰 Additional context used

🪛 Hadolint (2.14.0)

images/universal/training/py312-cuda128-torch280/Dockerfile

[error] 73-73: Do not refer to an environment variable within the same ENV statement where it is defined.

(DL3044)

[error] 124-124: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.

(SC2261)

🔇 Additional comments (2)

images/universal/training/py312-cuda128-torch280/Dockerfile (2)

196-201: LGTM: Entrypoint wrapper and dual-mode behavior.

The addition of the entrypoint wrapper and dual-mode setup (workbench by default, headless when args provided) aligns well with the design intent documented in the header. This allows the image to work both as an interactive Jupyter environment and as a runtime container for training jobs.

185-193: Verify whether --no-deps use on line 189 is an intentional design trade-off or unaddressed issue.

The review comment accurately identifies that --no-deps is present on line 189 and correctly explains the technical concern: it bypasses pip's dependency validation. However, the Dockerfile comments (lines 183–184) indicate a deliberate "deterministic 2-step build" design, which may intentionally use --no-deps to prevent re-resolution during the wheel build phase.

Before accepting the recommendation to remove --no-deps, confirm:

Whether this issue was actually flagged and deferred in a prior review

Whether the --no-deps approach is a known workaround documented in your project's build design

Whether removing --no-deps aligns with the stated goal of deterministic builds

If --no-deps was an unresolved concern from a prior review and the team decided it should be removed, the review comment's recommendation is sound. If it's an intentional design decision, the concern may need to be reconsidered differently (e.g., adding explicit dependency validation in CI).

coderabbitai · 2025-11-08T20:01:16Z

images/universal/training/py312-cuda128-torch280/Dockerfile

+ && echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \
+ && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \
 && dnf clean all \
 && rm -rf /var/cache/dnf/*


⚠️ Potential issue | 🟠 Major

Add ldconfig call after updating /etc/ld.so.conf.d/nvidia.conf.

Writing to ldconfig config files does not take effect until ldconfig is run to rebuild the cache. Without this step, the CUDA libraries at /usr/local/nvidia/lib and /usr/local/nvidia/lib64 will not be discoverable during the preflight check on line 82 or at runtime.

Apply this diff to add the ldconfig cache update:

&& echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \ && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \ + && ldconfig \ && dnf clean all \

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

&& echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \

&& echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \

&& dnf clean all \

&& rm -rf /var/cache/dnf/*

&& echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \

&& echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \

&& ldconfig \

&& dnf clean all \

&& rm -rf /var/cache/dnf/*

🤖 Prompt for AI Agents

In images/universal/training/py312-cuda128-torch280/Dockerfile around lines 67-70, after appending /usr/local/nvidia/lib and /usr/local/nvidia/lib64 to /etc/ld.so.conf.d/nvidia.conf you must run ldconfig to rebuild the dynamic linker cache so the CUDA libraries become discoverable; add a call to ldconfig immediately after the echo lines and before the dnf clean all / rm -rf /var/cache/dnf/* steps.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

images/universal/training/py312-cuda128-torch280/Dockerfile (3)
67-70: Missing ldconfig call—unresolved critical issue from prior review.

Lines 67–68 append CUDA library paths to /etc/ld.so.conf.d/nvidia.conf, but the dynamic linker cache is not rebuilt. The ldconfig -p check on line 82 comes too late and does not rebuild the cache. Without an explicit ldconfig call immediately after writing the config, CUDA libraries at /usr/local/nvidia/lib and /usr/local/nvidia/lib64 remain undiscoverable during preflight checks and at runtime.

Apply this diff to rebuild the cache immediately:
  && echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \
  && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \
+ && ldconfig \
  && dnf clean all \
142-142: Upgrade bitsandbytes to 0.48.2 or later—unresolved critical issue from prior review.

Line 142 pins bitsandbytes to 0.48.1, which lacks official CUDA 12.8 wheels. Versions 0.48.2 and later include published wheels with proper CUDA 12.8 support. Without this upgrade, the image will fail at runtime when bitsandbytes attempts to load libbitsandbytes_cuda128.

Apply this diff:
-    bitsandbytes==0.48.1 \
+    bitsandbytes==0.48.2 \
183-193: Remove --no-deps to enable dependency validation—unresolved critical issue from prior review.

Line 189 uses --no-deps when building mamba-ssm, which bypasses pip's dependency resolution and prevents verification that causal-conv1d and mamba-ssm versions are compatible. While the 2-step deterministic build with verbose logging is good for debugging, using --no-deps masks potential missing or incompatible dependencies.

Apply this diff to allow pip to validate the dependency chain:
- pip wheel --no-cache-dir --no-build-isolation -vv \
-   --log /tmp/pip-mamba-ssm.log \
-   mamba-ssm==2.2.6.post3 --no-deps -w /tmp/wheels || \
+ pip wheel --no-cache-dir --no-build-isolation -vv \
+   --log /tmp/pip-mamba-ssm.log \
+   causal-conv1d==1.5.3.post1 mamba-ssm==2.2.6.post3 -w /tmp/wheels || \
Then update line 191 accordingly:
- pip install --no-cache-dir /tmp/wheels/*.whl && \
+ pip install --no-cache-dir /tmp/wheels/mamba*.whl && \
Alternatively, add causal-conv1d==1.5.3.post1 to your training-hub requirements so pip resolves both transitively in a single install step.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9793396 and ad676de.

📒 Files selected for processing (4)

.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml (1 hunks)
.tekton/universal-image-py312-cuda128-torch280-push.yaml (1 hunks)
images/universal/training/py312-cuda128-torch280/Dockerfile (4 hunks)
images/universal/training/py312-cuda128-torch280/README.md (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

.tekton/universal-image-py312-cuda128-torch280-push.yaml
.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml

🧰 Additional context used

🪛 Hadolint (2.14.0)

images/universal/training/py312-cuda128-torch280/Dockerfile

[error] 73-73: Do not refer to an environment variable within the same ENV statement where it is defined.

(DL3044)

[error] 124-124: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.

(SC2261)

🔇 Additional comments (5)

images/universal/training/py312-cuda128-torch280/README.md (1)

7-7: Documentation update is accurate.

The addition of "Minimal Workbench" correctly reflects the base image used in the Dockerfile (line 14 of the Dockerfile).

images/universal/training/py312-cuda128-torch280/Dockerfile (4)

119-119: Build tooling upgrades are appropriate.

The modern build tooling installation (pip, setuptools, wheel, ninja, cmake) ensures proper extension builds for the downstream packages.

162-181: CUDA user-space libraries via pip is well-structured.

The dedicated pip block for NVIDIA CUDA libraries (lines 162–178) and the corresponding LD_LIBRARY_PATH update (line 181) provide proper isolation and reproducibility for extension builds that depend on cuDNN and other CUDA runtime libraries.

102-109: Torch installation and preflight checks are well-structured.

Separating the PyTorch cu128 install on line 102 from the default PyPI index prevents index resolution conflicts. The inline CUDA validation on lines 105–109 provides useful fail-fast feedback.

195-201: Entrypoint and CMD design supports both workbench and headless modes.

The entrypoint wrapper (line 196) with default CMD (line 201) allows the image to preserve workbench behavior by default while enabling headless runtime when a command is provided. This aligns well with the PR objective to support Training Hub 0.3.0 runtime scenarios.

images/universal/training/py312-cuda128-torch280/Dockerfile

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (4)

images/universal/training/py312-cuda128-torch280/Dockerfile (4)
67-70: ⚠️ [Unresolved] Missing ldconfig call after appending to /etc/ld.so.conf.d/nvidia.conf.

Writing to ldconfig config files does not take effect until ldconfig is run to rebuild the cache. The current code appends to /etc/ld.so.conf.d/nvidia.conf but does not call ldconfig, so the CUDA libraries at /usr/local/nvidia/lib and /usr/local/nvidia/lib64 will not be discoverable during the runtime checks on line 82 or at runtime. This was flagged in a prior review and remains unresolved.

Apply this diff to add the missing ldconfig call:
  && echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \
  && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \
+ && ldconfig \
  && dnf clean all \
73-76: ⚠️ [Unresolved] ENV statement violates Hadolint DL3044 with self-referencing variables.

Lines 74–75 reference ${PATH} and ${LD_LIBRARY_PATH} within the same ENV statement where they are being defined. This violates Docker best practices (Hadolint DL3044) and may cause unexpected behavior. This issue was flagged in a prior review and remains unresolved.

Refactor into separate ENV statements to allow prepending to existing values:
- ENV CUDA_HOME=/usr/local/cuda \
-     PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \
-     LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \
-     TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0"
+ ENV CUDA_HOME=/usr/local/cuda \
+     TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0"
+ ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH}
+ ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}
150-150: ⚠️ [Unresolved] Upgrade bitsandbytes from 0.48.1 to 0.48.2 or later for CUDA 12.8 wheel support.

Line 150 pins bitsandbytes==0.48.1, which lacks official CUDA 12.8 wheels. Bitsandbytes 0.48.2 and later include published wheels targeting CUDA 12.8, resolving runtime failures when loading libbitsandbytes_cuda128. This critical issue was flagged in a prior review and remains unresolved.

Apply this diff to upgrade the package:
- bitsandbytes==0.48.1 \
+ bitsandbytes==0.48.2 \
191-201: ⚠️ [Unresolved] mamba-ssm build still uses --no-deps, bypassing dependency validation.

Line 197 continues to use --no-deps when building the mamba-ssm wheel, which bypasses pip's dependency resolution and prevents verification that optional dependencies (like causal-conv1d) are compatible with the installed version. This critical issue was flagged in a prior review and remains unresolved.

While the 2-step deterministic build pattern (build wheel, then install from local wheel) is a reasonable approach for reproducibility and debugging, the --no-deps flag introduces risk.

Recommended fix: Install both packages together in a single pip invocation to let pip resolve compatibility:
- RUN python -m pip install --no-cache-dir --no-build-isolation -vv --log /tmp/pip-causal.log causal-conv1d==1.5.3.post1 && \
-     CMAKE_GENERATOR=Ninja CMAKE_ARGS="-DCMAKE_VERBOSE_MAKEFILE=ON" \
-     python -m pip wheel --no-cache-dir --no-build-isolation -vv \
-       --log /tmp/pip-mamba-ssm.log \
-       mamba-ssm==2.2.6.post3 --no-deps -w /tmp/wheels || \
-     (echo '--- pip mamba-ssm log (tail) ---'; tail -n 500 /tmp/pip-mamba-ssm.log; exit 1) && \
-     python -m pip install --no-cache-dir /tmp/wheels/*.whl && \
-     rm -rf /tmp/wheels && \
-     fix-permissions /opt/app-root -P
+ RUN CMAKE_GENERATOR=Ninja CMAKE_ARGS="-DCMAKE_VERBOSE_MAKEFILE=ON" \
+     python -m pip install --retries 5 --timeout 300 --no-cache-dir --no-build-isolation \
+       causal-conv1d==1.5.3.post1 mamba-ssm==2.2.6.post3 && \
+     fix-permissions /opt/app-root -P
Alternatively, if deterministic wheel builds are essential, remove --no-deps to allow pip to validate both packages together before installation.

🧹 Nitpick comments (1)

images/universal/training/py312-cuda128-torch280/Dockerfile (1)

108-110: Consider removing or conditionally gating diagnostic print statements.

Lines 108–110 print Python interpreter details and torch metadata during the build. These lines are useful for build-time debugging but are unusual in a production Dockerfile. Consider either removing them, moving them into a separate build stage, or gating them behind a build ARG for optional diagnostic mode.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d71708d and 9b73c61.

📒 Files selected for processing (4)

.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml (1 hunks)
.tekton/universal-image-py312-cuda128-torch280-push.yaml (1 hunks)
images/universal/training/py312-cuda128-torch280/Dockerfile (6 hunks)
images/universal/training/py312-cuda128-torch280/README.md (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

images/universal/training/py312-cuda128-torch280/README.md
.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml
.tekton/universal-image-py312-cuda128-torch280-push.yaml

🧰 Additional context used

🪛 Hadolint (2.14.0)

images/universal/training/py312-cuda128-torch280/Dockerfile

[error] 73-73: Do not refer to an environment variable within the same ENV statement where it is defined.

(DL3044)

[error] 132-132: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.

(SC2261)

🔇 Additional comments (5)

images/universal/training/py312-cuda128-torch280/Dockerfile (5)

95-96: Python user-site configuration is well configured.

Setting PYTHONUSERBASE and PYTHONNOUSERSITE provides clear control over user package installs and discovery paths, which aligns with the containerized environment goals.

106-106: Good adoption of python -m pip for improved determinism and consistency.

The conversion from pip to python -m pip throughout the file is a positive change that improves module import consistency and reduces ambiguity about which Python is being used. This is a best practice for container builds.

Also applies to: 120-120, 123-124, 127-127

171-186: NVIDIA CUDA user-space libraries via pip are well selected and versioned.

The addition of NVIDIA user-space CUDA libraries (nccl, cublas, cudnn, cufft, curand, cusolver, cusparse, nvjitlink, nvtx, etc.) provides the necessary runtime components for compiled extensions. Version pinning ensures reproducibility.

189-189: cuDNN path addition to LD_LIBRARY_PATH is appropriate.

Adding /opt/app-root/lib/python3.12/site-packages/nvidia/cudnn/lib to LD_LIBRARY_PATH ensures that cuDNN libraries installed via pip are discoverable during source builds. This enables consistent behavior between pip-installed and system-installed CUDA libraries.

204-209: ENTRYPOINT and CMD additions enable flexible invocation.

The addition of a POSIX entrypoint wrapper (entrypoint-universal.sh) and explicit ENTRYPOINT/CMD declarations provide clean separation between workbench mode (default, runs start-notebook.sh) and headless mode (when a command is provided). This aligns well with the design intent documented in the file header.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

images/universal/training/py312-cuda128-torch280/Dockerfile (3)
150-150: Resolve unaddressed critical issue: upgrade bitsandbytes to 0.48.2+ for CUDA 12.8 support.

Line 150 pins bitsandbytes to 0.48.1, which was flagged as a critical issue in prior reviews and remains unresolved. Version 0.48.1 lacks official CUDA 12.8 wheels; the runtime will fail when trying to load libbitsandbytes_cuda128. A prior review confirmed via web search that bitsandbytes 0.48.2 and later include published wheels targeting CUDA 12.8.

Apply this fix:
-    bitsandbytes==0.48.1 \
+    bitsandbytes==0.48.2 \
67-70: Add ldconfig call to rebuild dynamic linker cache after updating ld.so.conf.d.

Lines 67–68 append NVIDIA library paths to /etc/ld.so.conf.d/nvidia.conf, but the dynamic linker cache is not rebuilt. This was flagged as a major issue in prior reviews. Without calling ldconfig, the CUDA libraries at /usr/local/nvidia/lib and /usr/local/nvidia/lib64 remain undiscoverable, and the preflight check at line 82 may fail.

Apply this diff:
  && echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \
  && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \
+ && ldconfig \
  && dnf clean all \
73-76: Split ENV statement to avoid self-referencing variables (Hadolint DL3044).

Lines 73–76 define CUDA_HOME, PATH, LD_LIBRARY_PATH, and TORCH_CUDA_ARCH_LIST in a single ENV block, with PATH and LD_LIBRARY_PATH self-referencing their own values. This violates Hadolint DL3044 and may behave unexpectedly. This was flagged as a major issue in prior reviews and remains unresolved.

Refactor into separate ENV statements to allow safe prepending:
- ENV CUDA_HOME=/usr/local/cuda \
-     PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \
-     LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \
-     TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0"
+ ENV CUDA_HOME=/usr/local/cuda
+ ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH}
+ ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}
+ ENV TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0"

🧹 Nitpick comments (1)

images/universal/training/py312-cuda128-torch280/Dockerfile (1)

189-199: Consider removing or better-documenting the large commented-out code block.

Lines 189–199 contain a commented-out 2-step deterministic build for causal-conv1d and mamba-ssm with verbose logging. If these packages are not planned for immediate use, remove this block to keep the Dockerfile clean. If these are deferred features, add a clearer comment explaining the reason for deferral (e.g., "# TODO: Enable after resolving mamba-ssm --no-deps issue" or "# DEBUG: Disabled pending package availability").

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ff5f03b and da81998.

📒 Files selected for processing (4)

.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml (1 hunks)
.tekton/universal-image-py312-cuda128-torch280-push.yaml (1 hunks)
images/universal/training/py312-cuda128-torch280/Dockerfile (6 hunks)
images/universal/training/py312-cuda128-torch280/README.md (1 hunks)

✅ Files skipped from review due to trivial changes (1)

images/universal/training/py312-cuda128-torch280/README.md

🚧 Files skipped from review as they are similar to previous changes (2)

.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml
.tekton/universal-image-py312-cuda128-torch280-push.yaml

🧰 Additional context used

🪛 Hadolint (2.14.0)

images/universal/training/py312-cuda128-torch280/Dockerfile

[error] 73-73: Do not refer to an environment variable within the same ENV statement where it is defined.

(DL3044)

[error] 132-132: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.

(SC2261)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Red Hat Konflux / universal-image-py312-cuda128-torch280-on-pull-request

🔇 Additional comments (2)

images/universal/training/py312-cuda128-torch280/Dockerfile (2)

95-96: Good additions: modern pip usage, diagnostic support, explicit NVIDIA wheels, and proper entrypoint setup.

The following improvements are well-executed:

Lines 95–96: Python user-site environment variables properly configured.

Lines 106, 120, 123–124, 127, 132: Switched to python -m pip (modern best practice).

Lines 108–110: Added diagnostics for Python environment and torch availability.

Lines 168–184: Explicit installation of NVIDIA CUDA user-space libraries via pip ensures extension builds have necessary headers and libraries.

Line 187: LD_LIBRARY_PATH extended to include cuDNN from pip, aligning with pip-installed wheels.

Lines 202–207: ENTRYPOINT and CMD properly set, maintaining design intent.

Also applies to: 106-106, 120-120, 123-124, 127-127, 132-132, 168-184, 187-187, 202-207

202-202: No action required—entrypoint-universal.sh exists and is valid.

The file at ./images/universal/training/py312-cuda128-torch280/entrypoint-universal.sh exists in the repository and contains valid shell script content. The COPY command on line 202 will execute successfully.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (4)

images/universal/training/py312-cuda128-torch280/Dockerfile (4)
177-177: Critical: Update bitsandbytes to version 0.48.2 or later (duplicate fix required).

This is the second occurrence of bitsandbytes==0.48.1 that must be updated to match the fail-fast check. See the preceding comment for full context.

Apply this diff:
- bitsandbytes==0.48.1 \
+ bitsandbytes==0.48.2 \
151-152: Critical: Update bitsandbytes to version 0.48.2 or later for CUDA 12.8 support.

bitsandbytes 0.48.1 publishes only generic manylinux wheels without CUDA 12.8-specific wheels on PyPI. Since this Dockerfile targets CUDA 12.8, the fail-fast check here will fail during installation, and even if it passes, runtime will fail when attempting to load libbitsandbytes_cuda128. This issue was previously flagged as critical and marked as resolved, but the version remains unchanged.

Apply this diff:
- python -m pip download --retries 5 --timeout 120 --only-binary=:all: --no-deps -d /tmp/wheels \
-   numba==0.62.1 bitsandbytes==0.48.1 || \
+ python -m pip download --retries 5 --timeout 120 --only-binary=:all: --no-deps -d /tmp/wheels \
+   numba==0.62.1 bitsandbytes==0.48.2 || \
68-71: Add ldconfig call after updating /etc/ld.so.conf.d/nvidia.conf.

Writing to ldconfig config files does not take effect until ldconfig is run to rebuild the cache. Without this step, the CUDA libraries at /usr/local/nvidia/lib and /usr/local/nvidia/lib64 will not be discoverable during the preflight checks on line 88 or at runtime.

Apply this diff to add the ldconfig call:
  && echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \
  && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \
+ && ldconfig \
  && dnf clean all \
73-76: Fix environment variable self-reference (Hadolint DL3044).

Within a single ENV statement, you cannot reference a variable being defined (e.g., ${PATH} and ${LD_LIBRARY_PATH} on lines 75-76). This violates Docker best practices. Refactor into separate ENV statements to allow prepending to existing values.

Apply this diff to split the ENV block:
- ENV CUDA_HOME=/usr/local/cuda \
-     PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \
-     LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH 
-     # TODO: revisit this, going with default list for now.
-     # TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0"
+ ENV CUDA_HOME=/usr/local/cuda \
+     TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0"
+ ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH}
+ ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}
Also uncomment TORCH_CUDA_ARCH_LIST (currently commented on line 78) to enable explicit CUDA architecture targeting for builds.

🧹 Nitpick comments (1)

images/universal/training/py312-cuda128-torch280/Dockerfile (1)

228-264: Clarify rationale for mamba-ssm --no-deps pattern.

The 2-step installation of causal-conv1d followed by mamba-ssm with --no-deps (line 241) bypasses pip's dependency resolution. While this can improve determinism and control build order, it prevents pip from verifying compatibility with optional components. If causal-conv1d is missing, mamba-ssm falls back to a slower eager implementation rather than failing fast.

Consider documenting in a comment why this explicit 2-step approach is necessary (e.g., "avoids pip resolver conflicts with training-hub" or similar), or alternatively, declare causal-conv1d==1.5.3.post1 explicitly in the training-hub requirements so pip can resolve both transitively with proper version constraints.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1b7f43e and ad29950.

📒 Files selected for processing (4)

.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml (1 hunks)
.tekton/universal-image-py312-cuda128-torch280-push.yaml (1 hunks)
images/universal/training/py312-cuda128-torch280/Dockerfile (6 hunks)
images/universal/training/py312-cuda128-torch280/README.md (1 hunks)

✅ Files skipped from review due to trivial changes (1)

.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml

🚧 Files skipped from review as they are similar to previous changes (2)

.tekton/universal-image-py312-cuda128-torch280-push.yaml
images/universal/training/py312-cuda128-torch280/README.md

🧰 Additional context used

🪛 Hadolint (2.14.0)

images/universal/training/py312-cuda128-torch280/Dockerfile

[error] 234-234: unexpected 's'
expecting '#', '', ADD, ARG, CMD, COPY, ENTRYPOINT, ENV, EXPOSE, FROM, HEALTHCHECK, LABEL, MAINTAINER, ONBUILD, RUN, SHELL, STOPSIGNAL, USER, VOLUME, WORKDIR, a pragma, at least one space, or end of input

(DL1000)

🔇 Additional comments (3)

images/universal/training/py312-cuda128-torch280/Dockerfile (3)
22-22: License copy and pip configuration look good.

The LICENSE.md file has been verified to exist in the repository, and the new pip environment variables (PIP_DISABLE_PIP_VERSION_CHECK, PIP_UPGRADE_STRATEGY) are appropriate additions for deterministic builds.

Also applies to: 33-34

232-260: Address Hadolint DL1000 syntax error in mamba-ssm heredoc.

Static analysis reports an "unexpected 's'" error at line 234, suggesting potential issues with the complex quoting in the heredoc structure. While the syntax may be valid, the nested quoting pattern 'cat > file <<'\''BASH'\'' is difficult to parse and maintain.

Consider refactoring for clarity. One approach is to avoid the complex quote escaping by using a different pattern:
- bash -lc 'cat > /tmp/build_mamba_wheel.sh <<'\''BASH'\'' 
+ bash -c 'cat > /tmp/build_mamba_wheel.sh << "BASH"'
This eliminates the escaped-quote pattern. Verify that the refactored version produces the same heredoc behavior and that Hadolint validation passes.

205-224: NVIDIA CUDA packages and entrypoint setup look good.

The comprehensive set of NVIDIA CUDA user-space packages (lines 205-221) are properly versioned and include cuDNN, NCCL, and build libraries. The LD_LIBRARY_PATH update on line 224 correctly points to pip-installed cuDNN. The entrypoint wrapper pattern (lines 266-272) provides sensible flexibility between workbench (default) and headless runtime modes.

Also applies to: 266-272

briangallagher · 2025-11-11T13:51:34Z

/retest

briangallagher · 2025-11-11T13:56:54Z

/retest

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

images/universal/training/py312-cuda128-torch280/Dockerfile (2)
68-71: Add ldconfig call to make /etc/ld.so.conf.d/nvidia.conf paths discoverable.

Add a file with the .conf extension to /etc/ld.so.conf.d/ that contains the paths to the libraries and then run ldconfig. Modifying /etc/ld.so.conf.d/nvidia.conf alone does not take effect until ldconfig is invoked to rebuild the dynamic linker cache. The preflight check at line 88 (ldconfig -p | grep) will fail if ldconfig is not run after the echo commands.
  && echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \
  && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \
+ && ldconfig \
  && dnf clean all \
74-76: Split ENV statements to resolve self-reference violation (Hadolint DL3044).

Docker will not expand a variable within the same ENV statement where it is defined. While it will not crash Docker, it carries a high likelihood of errors. Lines 74-76 reference ${PATH} and ${LD_LIBRARY_PATH} in the same ENV statement where they are being prepended, violating Hadolint's DL3044 rule.

Refactor into separate ENV statements:
- ENV CUDA_HOME=/usr/local/cuda \
-     PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \
-     LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH 
-     # TODO: revisit this, going with default list for now.
-     # TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0"
+ ENV CUDA_HOME=/usr/local/cuda
+ ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH}
+ ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}
Also note: the TODO comment about TORCH_CUDA_ARCH_LIST was removed; ensure this is intentional and the default Torch compile list is acceptable for your GPUs.

🧹 Nitpick comments (1)

images/universal/training/py312-cuda128-torch280/Dockerfile (1)

228-282: Simplify or document the mamba-ssm wheel build logic.

Lines 228-282 implement a complex heredoc-based shell script embedded in a single RUN instruction to build the mamba-ssm wheel with timeout and heartbeat monitoring. While the intent (prevent hangs, log progress) is sound, the 50+ lines of inline shell logic make the Dockerfile harder to maintain. Consider:

Extract to a separate build script in the repository (e.g., build-mamba-ssm.sh) and COPY it into the image before running it. This improves readability and allows unit testing of the script.

Add a brief comment explaining why the 2-step causal-conv1d + mamba-ssm approach is necessary (e.g., "Mamba-SSM build takes >10 minutes; timeout ensures the process completes or fails safely without hanging the build").

Verify the --no-deps flag on line 280 is intentional—are there any transitive dependencies of mamba-ssm beyond causal-conv1d that should be installed?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7553cf9 and 8927298.

📒 Files selected for processing (4)

.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml (1 hunks)
.tekton/universal-image-py312-cuda128-torch280-push.yaml (1 hunks)
images/universal/training/py312-cuda128-torch280/Dockerfile (6 hunks)
images/universal/training/py312-cuda128-torch280/README.md (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml
.tekton/universal-image-py312-cuda128-torch280-push.yaml
images/universal/training/py312-cuda128-torch280/README.md

🧰 Additional context used

🪛 Hadolint (2.14.0)

images/universal/training/py312-cuda128-torch280/Dockerfile

[error] 74-74: Do not refer to an environment variable within the same ENV statement where it is defined.

(DL3044)

[error] 157-157: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.

(SC2261)

🔇 Additional comments (3)

images/universal/training/py312-cuda128-torch280/Dockerfile (3)

22-22: Verify LICENSE.md file exists in repository.

Line 22 copies ./images/universal/training/py312-cuda128-torch280/LICENSE.md into the image. Ensure this file exists in the repository, otherwise the Docker build will fail with a COPY error. If the CUDA license should not be bundled, remove this instruction.

285-285: Verify entrypoint-universal.sh file exists in repository.

Line 285 copies ./images/universal/training/py312-cuda128-torch280/entrypoint-universal.sh to the image. Ensure this script exists in the repository and is executable. If this file is missing, the Docker build will fail.

151-152: No version upgrade needed; bitsandbytes 0.48.1 already supports CUDA 12.8.

bitsandbytes 0.48.1 and 0.48.2 both have prebuilt wheels covering CUDA 12.6 and 12.8, so the suggested upgrade from 0.48.1 to 0.48.2 is not justified. The claim that 0.48.1 "publishes only generic manylinux wheels without CUDA version tagging" is incorrect—the published wheels cover CUDA 12.8 through the manylinux wheels on PyPI and GitHub. The fail-fast check at line 151 will succeed as intended.

All supporting files (LICENSE.md and entrypoint-universal.sh) exist in the repository. The ldconfig call does occur in the build process (line 88), though not immediately after writing to /etc/ld.so.conf.d/nvidia.conf (lines 68–69), which is acceptable across separate RUN layers.

Likely an incorrect or invalid review comment.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

images/universal/training/py312-cuda128-torch280/Dockerfile (3)
151-151: Update bitsandbytes to 0.48.2 or later (critical, unresolved from prior reviews).

Line 151 still pins bitsandbytes==0.48.1, which lacks official CUDA 12.8 wheels on PyPI. This was flagged as critical in prior reviews and will cause runtime failures when attempting to load libbitsandbytes_cuda128. Upgrade to 0.48.2 or later for cu128 support:
- numba==0.62.1 bitsandbytes==0.48.1 || \
+ numba==0.62.1 bitsandbytes==0.48.2 || \
Also update line 177:
- bitsandbytes==0.48.1 \
+ bitsandbytes==0.48.2 \
68-71: Missing ldconfig call after updating /etc/ld.so.conf.d/nvidia.conf.

Writing to ldconfig config files requires rebuilding the dynamic linker cache for the paths to become discoverable. The preflight check on line 88 (ldconfig -p | grep -E) will fail without this step.

Apply this diff to add the ldconfig call:
  && echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \
  && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \
+ && ldconfig \
  && dnf clean all \
74-76: Fix Hadolint DL3044: split self-referencing ENV variables into separate statements.

Referencing ${PATH} and ${LD_LIBRARY_PATH} within the same ENV block where they are being defined violates Docker best practices. Split these into separate ENV statements:
- ENV CUDA_HOME=/usr/local/cuda \
-     PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \
-     LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH 
-     # TODO: revisit this, going with default list for now.
-     # TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0"
+ ENV CUDA_HOME=/usr/local/cuda
+ ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH}
+ ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}

🧹 Nitpick comments (1)

images/universal/training/py312-cuda128-torch280/Dockerfile (1)
228-285: Consider removing --no-deps from mamba-ssm and installing both packages together.

Line 243 uses --no-deps to skip dependency validation, relying on causal-conv1d being pre-installed on line 228. While this works because causal-conv1d is indeed installed first, it bypasses pip's dependency resolver, preventing verification of version compatibility.

The complex shell wrapper adds valuable diagnostics and timeout handling, but the underlying pattern could be improved. Consider this alternative:
- RUN python -m pip install --no-cache-dir --no-build-isolation -vv --log /tmp/pip-causal.log causal-conv1d==1.5.3.post1 && \
+ RUN python -m pip install --no-cache-dir --no-build-isolation causal-conv1d==1.5.3.post1 && \
    [shell script with diagnostics/timeout] && \
-     python -m pip install --no-cache-dir /tmp/wheels/*.whl && \
+     python -m pip install --no-cache-dir /tmp/wheels/*.whl --no-deps && \
Or better yet: add causal-conv1d==1.5.3.post1 to your training-hub requirements so pip resolves both transitively without --no-deps.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8927298 and 02c9a56.

⛔ Files ignored due to path filters (2)

images/runtime/.DS_Store is excluded by !**/.DS_Store
images/runtime/training/.DS_Store is excluded by !**/.DS_Store

📒 Files selected for processing (4)

.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml (1 hunks)
.tekton/universal-image-py312-cuda128-torch280-push.yaml (1 hunks)
images/universal/training/py312-cuda128-torch280/Dockerfile (6 hunks)
images/universal/training/py312-cuda128-torch280/README.md (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

.tekton/universal-image-py312-cuda128-torch280-push.yaml
.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml
images/universal/training/py312-cuda128-torch280/README.md

🧰 Additional context used

🪛 Hadolint (2.14.0)

images/universal/training/py312-cuda128-torch280/Dockerfile

[error] 74-74: Do not refer to an environment variable within the same ENV statement where it is defined.

(DL3044)

[error] 157-157: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.

(SC2261)

🔇 Additional comments (6)

images/universal/training/py312-cuda128-torch280/Dockerfile (6)

33-34: Good: PIP environment variables improve build determinism.

The addition of PIP_UPGRADE_STRATEGY=only-if-needed (line 34) and other pip environment settings promote reproducibility and control over dependency resolution.

58-60: Good: CUDA toolchain, build environment, and user configuration.

The additions of CUDA development packages (cuda-libraries-devel-12-8, cuda-compiler-12-8) on lines 58-60, build environment variables (USE_NINJA, TORCH_CUDA_VERBOSE_BUILD) on lines 81-83, and proper PYTHONUSERBASE configuration on lines 107-108 improve the build environment and extension compatibility.

Also applies to: 81-83, 107-108

96-99: Good: Preflight diagnostics improve debuggability.

Adding diagnostic output for pip, compiler versions, and nvcc early in the build helps quickly identify environment issues before expensive package installations.

206-221: Good: NVIDIA CUDA pip packages and cuDNN path for extension builds.

Installing NVIDIA CUDA pip packages (nvidia-cudnn-cu12, nvidia-cublas-cu12, etc.) on lines 206-221 and adding the cuDNN lib path on line 224 properly support source builds of extensions like flash-attn and mamba-ssm that expect these libraries to be discoverable.

Also applies to: 224-224

191-193: Good: Training hub 0.3.0 with documented version constraint handling.

Line 191 correctly pins training-hub==0.3.0, supporting the PR objective. The comments on lines 138-143 and 197-200 pragmatically document the known upstream kubernetes/urllib3 version mismatch and defer pip check until the upstream constraint loosens. This is a reasonable approach given the constraint is external.

Also applies to: 138-143, 197-200

22-22: Files exist; no issues found.

Both LICENSE.md and entrypoint-universal.sh are present at their expected paths. The COPY operations on lines 22 and 288 will succeed without errors.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

images/universal/training/py312-cuda128-torch280/Dockerfile (3)
68-71: Missing ldconfig call after updating /etc/ld.so.conf.d/nvidia.conf.

Writing to ldconfig config files does not take effect until ldconfig is run to rebuild the cache. Lines 68–69 append NVIDIA library paths to the config, but the cache is never rebuilt. Without this step, the CUDA libraries at /usr/local/nvidia/lib and /usr/local/nvidia/lib64 will not be discoverable at runtime or during the preflight check on line 88.

Apply this diff to rebuild the ldconfig cache:
 && echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \
 && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \
+ && ldconfig \
 && dnf clean all \
183-183: Upgrade bitsandbytes to 0.48.2 or later for CUDA 12.8 compatibility.

Line 183 pins bitsandbytes==0.48.1, which was flagged as incompatible with CUDA 12.8 in prior reviews. Ensure consistency: if the fail-fast check on line 157 is updated to a newer version, this line must also be updated.

Apply this diff:
-    bitsandbytes==0.48.1 \
+    bitsandbytes==0.48.2 \
75-78: Fix Hadolint DL3044: ENV statement cannot self-reference its own variables.

Lines 75–78 violate Docker best practices (and trigger Hadolint DL3044) by referencing ${PATH} and ${LD_LIBRARY_PATH} within the same ENV statement where they are being set. This can cause unexpected behavior depending on the Docker version.

Refactor into separate ENV statements to allow prepending to existing values:
-ENV CUDA_HOME=/usr/local/cuda \
-    PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \
-    LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \
-    TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}"
+ENV CUDA_HOME=/usr/local/cuda \
+    TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}"
+ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH}
+ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}

🧹 Nitpick comments (2)

images/universal/training/py312-cuda128-torch280/Dockerfile (2)
163-163: Simplify the find command to avoid Shellcheck SC2261 warning.

Hadolint and Shellcheck flag line 163 for multiple redirections competing for stdout. This occurs within the fail-fast check when running find /tmp -type f -name "*.o" | wc -l || true. While the || true fallback handles errors gracefully, a cleaner pattern would silence the output on failure:
-      numba==0.62.1 bitsandbytes==0.48.1 || \
+      numba==0.62.1 bitsandbytes==0.48.2 || \
(Note: Once bitsandbytes is upgraded to 0.48.2 per the critical issue above, this refactor becomes moot.)

232-316: Mamba-SSM wheel build section: extensive logging is welcome, but document the timeout logic.

Lines 232–316 implement an impressive deterministic 2-step wheel build with ccache support, detailed heartbeat logging, and a 20-minute timeout to prevent runaway compilation. The timeout and process-group cleanup logic are robust. However, the inline shell script (generated dynamically) is difficult to debug and maintain.

Recommendations for a future refactor:

Extract the mamba-ssm build script into a separate file in the Dockerfile directory (e.g., build-mamba-ssm.sh) and COPY it into the image, making it easier to version control and debug.

Add a comment explaining why the 20-minute timeout is chosen (e.g., "Historical data shows mamba-ssm 2.2.6.post3 compiles in ~8–15 minutes on c5.4xlarge; 20 minutes provides 33% headroom").

Document why --no-deps is used for mamba-ssm and whether causal-conv1d is a hard requirement or optional fallback.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 57025bf and d62c537.

⛔ Files ignored due to path filters (2)

images/runtime/.DS_Store is excluded by !**/.DS_Store
images/runtime/training/.DS_Store is excluded by !**/.DS_Store

📒 Files selected for processing (4)

.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml (1 hunks)
.tekton/universal-image-py312-cuda128-torch280-push.yaml (1 hunks)
images/universal/training/py312-cuda128-torch280/Dockerfile (5 hunks)
images/universal/training/py312-cuda128-torch280/README.md (1 hunks)

✅ Files skipped from review due to trivial changes (1)

images/universal/training/py312-cuda128-torch280/README.md

🚧 Files skipped from review as they are similar to previous changes (1)

.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml

🧰 Additional context used

🪛 Hadolint (2.14.0)

images/universal/training/py312-cuda128-torch280/Dockerfile

[error] 75-75: Do not refer to an environment variable within the same ENV statement where it is defined.

(DL3044)

[error] 163-163: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.

(SC2261)

🔇 Additional comments (5)

.tekton/universal-image-py312-cuda128-torch280-push.yaml (1)

20-21: Pipeline timeout extension aligns with expanded build scope.

The 9-hour timeout complements the 8-hour timeout on the individual build-container task and accommodates the Dockerfile's extended diagnostics and multi-phase wheel builds.

images/universal/training/py312-cuda128-torch280/Dockerfile (4)

142-142: Clarify the branch name and confirm the SDK repository exists.

Line 142 installs the SDK from a git URL referencing the training-hub branch: git+https://github.com/briangallagher/sdk@training-hub. Confirm that this branch exists in that repository and is intended for this PR (as opposed to using a released PyPI version of the SDK).

319-319: No issues found — entrypoint-universal.sh exists and COPY will succeed.

The file is present at the specified path. Line 319's COPY command will execute without errors during the Docker build.

22-22: LICENSE.md file verified — no action required.

The LICENSE.md file exists at the specified path. The COPY instruction will execute successfully during the Docker build.

157-158: No action required — bitsandbytes 0.48.1 includes official CUDA 12.8 wheels.

The bitsandbytes v0.48.x series includes CUDA 12.8 wheels with support for Linux x86_64, aarch64, and Windows. The pinned version at line 157 is correct and will not cause the runtime failures previously flagged in prior reviews. The fail-fast mechanism will function as intended.

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (2)

images/universal/training/py312-cuda128-torch280/Dockerfile (2)
68-71: Add ldconfig call to rebuild dynamic linker cache after updating /etc/ld.so.conf.d/.

Lines 68–71 append NVIDIA library paths to /etc/ld.so.conf.d/nvidia.conf, but the dynamic linker cache is not rebuilt. Without ldconfig, CUDA libraries will not be discoverable at line 88 (ldconfig grep verification) or at runtime. This was flagged in prior reviews.

Apply this diff:
  && echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \
  && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \
+ && ldconfig \
  && dnf clean all \
157-157: Critical regression: bitsandbytes 0.48.1 lacks CUDA 12.8 wheels—upgrade required.

Lines 157 and 183 still pin bitsandbytes==0.48.1, which lacks CUDA 12.8 (cu128) wheel support. The fail-fast check at line 157 will fail during image build, and at runtime, the library fails to load libbitsandbytes_cuda128.

Upgrade to bitsandbytes==0.48.2 or later (verified cu128 support):

Line 157:
- numba==0.62.1 bitsandbytes==0.48.1 || \
+ numba==0.62.1 bitsandbytes==0.48.2 || \
Line 183:
- bitsandbytes==0.48.1 \
+ bitsandbytes==0.48.2 \

🧹 Nitpick comments (1)

images/universal/training/py312-cuda128-torch280/Dockerfile (1)

232-235: Clarify the 2-step causal-conv1d + mamba-ssm --no-deps pattern.

Lines 232–235 use a 2-step installation with --no-deps on mamba-ssm, which bypasses pip's dependency verification. While this pattern can improve build determinism, document why this explicit approach is necessary (e.g., "causal-conv1d is optional for mamba-ssm and required for performance; using --no-deps avoids resolver conflicts with training-hub").

Alternatively, declare causal-conv1d==1.5.3.post1 in project requirements so pip resolves both transitive versions with proper constraints.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d62c537 and ba6e504.

⛔ Files ignored due to path filters (2)

images/runtime/.DS_Store is excluded by !**/.DS_Store
images/runtime/training/.DS_Store is excluded by !**/.DS_Store

📒 Files selected for processing (5)

.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml (1 hunks)
.tekton/universal-image-py312-cuda128-torch280-push.yaml (1 hunks)
images/universal/training/py312-cuda128-torch280/Dockerfile (5 hunks)
images/universal/training/py312-cuda128-torch280/Dockerfile.debug (1 hunks)
images/universal/training/py312-cuda128-torch280/README.md (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

.tekton/universal-image-py312-cuda128-torch280-push.yaml
images/universal/training/py312-cuda128-torch280/README.md

🧰 Additional context used

🪛 Hadolint (2.14.0)

images/universal/training/py312-cuda128-torch280/Dockerfile

[error] 75-75: Do not refer to an environment variable within the same ENV statement where it is defined.

(DL3044)

[error] 163-163: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.

(SC2261)

images/universal/training/py312-cuda128-torch280/Dockerfile.debug

[error] 75-75: Do not refer to an environment variable within the same ENV statement where it is defined.

(DL3044)

[error] 163-163: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.

(SC2261)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Red Hat Konflux / universal-image-py312-cuda128-torch280-on-pull-request

🔇 Additional comments (3)

.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml (1)

21-22: Pipeline timeout increase is appropriate for extended build workflow.

The 9-hour timeout aligns with the expanded build steps (CUDA diagnostics, two-phase wheel compilation, extensive package installation) introduced in the updated Dockerfile. No other changes to pipeline structure or task definitions.

images/universal/training/py312-cuda128-torch280/Dockerfile.debug (1)

22-22: No issues found—required source files are present.

Verification confirms both LICENSE.md and entrypoint-universal.sh exist at the expected paths. The Dockerfile COPY commands at lines 22 and 319 will succeed.

images/universal/training/py312-cuda128-torch280/Dockerfile (1)

22-22: ---

Required source files verified to exist in repository.

Both LICENSE.md and entrypoint-universal.sh are committed and present at their expected locations. The Dockerfile COPY commands at lines 22 and 238 will succeed without issues.

coderabbitai · 2025-11-12T12:41:27Z

images/universal/training/py312-cuda128-torch280/Dockerfile

 ENV CUDA_HOME=/usr/local/cuda \
-    PATH=/usr/local/cuda/bin:$PATH \
-    LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH \
-    TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0"
-
+    PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \
+    LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \
+    TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}"


⚠️ Potential issue | 🔴 Critical

Resolve Hadolint DL3044: ENV statement self-references variable being defined.

Lines 75–78 violate Docker best practice by referencing ${CUDA_HOME} within the same ENV statement where it is being defined. Additionally, ${LD_LIBRARY_PATH} on line 77 references itself within the same statement. This was flagged in prior reviews but remains unresolved.

Refactor into separate ENV statements:

- ARG CUDA_ARCH_LIST=9.0 - ENV CUDA_HOME=/usr/local/cuda \ - PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \ - LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \ - TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" + ARG CUDA_ARCH_LIST=9.0 + ENV CUDA_HOME=/usr/local/cuda \ + TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" + ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} + ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}

🧰 Tools

🪛 Hadolint (2.14.0)

[error] 75-75: Do not refer to an environment variable within the same ENV statement where it is defined.

(DL3044)

🤖 Prompt for AI Agents

In images/universal/training/py312-cuda128-torch280/Dockerfile around lines 75 to 78, the single ENV statement self-references variables being defined (CUDA_HOME and LD_LIBRARY_PATH) which triggers Hadolint DL3044; split into multiple ENV statements: first set CUDA_HOME alone, then set PATH referencing the already-defined CUDA_HOME, then set LD_LIBRARY_PATH referencing CUDA_HOME and the prior LD_LIBRARY_PATH, and finally set TORCH_CUDA_ARCH_LIST in its own ENV line so no variable is referenced in the same ENV where it is defined.

coderabbitai · 2025-11-12T12:41:27Z

images/universal/training/py312-cuda128-torch280/Dockerfile.debug

+ && echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \
+ && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \
+ && dnf clean all \
+ && rm -rf /var/cache/dnf/*


⚠️ Potential issue | 🔴 Critical

Add ldconfig call to rebuild dynamic linker cache after updating /etc/ld.so.conf.d/.

Lines 68–71 append NVIDIA library paths to /etc/ld.so.conf.d/nvidia.conf, but the dynamic linker cache is not rebuilt. Without running ldconfig, the CUDA libraries will not be discoverable at line 88 (ldconfig grep check) or at runtime.

Apply this diff:

&& echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \ && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \ + && ldconfig \ && dnf clean all \

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

&& echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \

&& echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \

&& dnf clean all \

&& rm -rf /var/cache/dnf/*

&& echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \

&& echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \

&& ldconfig \

&& dnf clean all \

&& rm -rf /var/cache/dnf/*

🤖 Prompt for AI Agents

In images/universal/training/py312-cuda128-torch280/Dockerfile.debug around lines 68 to 71, after appending NVIDIA library paths to /etc/ld.so.conf.d/nvidia.conf you must run ldconfig to rebuild the dynamic linker cache so the CUDA libraries are discoverable later; insert a step that runs ldconfig (as root) immediately after the echo lines and before the dnf clean/all and rm -rf /var/cache/dnf/* lines.

coderabbitai · 2025-11-12T12:41:27Z

images/universal/training/py312-cuda128-torch280/Dockerfile.debug

+ENV CUDA_HOME=/usr/local/cuda \
+    PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \
+    LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \
+    TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}"


⚠️ Potential issue | 🔴 Critical

Resolve Hadolint DL3044: ENV statement self-references variable being defined.

Lines 75–78 violate Docker best practice by referencing ${CUDA_HOME} within the same ENV statement where it is being defined. Additionally, ${LD_LIBRARY_PATH} on line 77 references itself within the same statement. This prevents Docker from properly expanding values and is flagged by Hadolint DL3044.

Refactor into separate ENV statements:

- ARG CUDA_ARCH_LIST=9.0 - ENV CUDA_HOME=/usr/local/cuda \ - PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \ - LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \ - TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" + ARG CUDA_ARCH_LIST=9.0 + ENV CUDA_HOME=/usr/local/cuda \ + TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" + ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} + ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Hadolint (2.14.0)

[error] 75-75: Do not refer to an environment variable within the same ENV statement where it is defined.

(DL3044)

🤖 Prompt for AI Agents

In images/universal/training/py312-cuda128-torch280/Dockerfile.debug around lines 75 to 78, the ENV statement self-references variables being set (CUDA_HOME and LD_LIBRARY_PATH), which prevents proper expansion and triggers Hadolint DL3044; split the combined ENV into multiple ENV statements so CUDA_HOME is defined first, then set PATH and LD_LIBRARY_PATH in subsequent ENV lines that reference the already-defined CUDA_HOME, and set TORCH_CUDA_ARCH_LIST in its own ENV (or after CUDA_HOME) to ensure variables are expanded correctly.

MStokluska · 2025-11-12T14:14:20Z

/retest

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (5)

images/universal/training/py312-cuda128-torch280/Dockerfile (3)
68-71: Add ldconfig call after updating /etc/ld.so.conf.d/nvidia.conf.

Lines 68–69 append NVIDIA library paths to ldconfig config, but the dynamic linker cache is not rebuilt. Without ldconfig, the CUDA libraries will not be discoverable at line 88 (ldconfig grep check) or at runtime. This was flagged in prior reviews but remains unresolved.

Apply this diff:
  && echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \
  && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \
+ && ldconfig \
  && dnf clean all \
75-78: Hadolint DL3044 violation: ENV statement self-references variables being defined.

Lines 75–78 violate Docker best practices by referencing ${CUDA_HOME} and ${LD_LIBRARY_PATH} within the same ENV statement where they are being defined. This was flagged in prior reviews but remains unresolved.

Split into separate ENV statements:
+ ARG CUDA_ARCH_LIST=9.0
+ ENV CUDA_HOME=/usr/local/cuda \
+     TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}"
+ ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH}
+ ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}
- ARG CUDA_ARCH_LIST=9.0
- ENV CUDA_HOME=/usr/local/cuda \
-     PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \
-     LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \
-     TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}"
157-157: Update bitsandbytes from 0.48.1 to 0.48.2 at lines 157 and 183.

bitsandbytes 0.48.1 has a confirmed packaging/build problem with CUDA 12.8 wheels, and 0.48.2 contains the fix. The Dockerfile must apply the suggested diffs to both lines:

Line 157:
- numba==0.62.1 bitsandbytes==0.48.1 || \
+ numba==0.62.1 bitsandbytes==0.48.2 || \
Line 183:
- bitsandbytes==0.48.1 \
+ bitsandbytes==0.48.2 \
images/universal/training/py312-cuda128-torch280/Dockerfile.debug (2)
68-71: Add ldconfig call after updating /etc/ld.so.conf.d/nvidia.conf.

Lines 68–69 append NVIDIA library paths to ldconfig config, but the dynamic linker cache is not rebuilt. Without ldconfig, the CUDA libraries will not be discoverable at line 88 (ldconfig grep check) or at runtime. This was flagged in prior reviews but remains unresolved.

Apply this diff:
  && echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \
  && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \
+ && ldconfig \
  && dnf clean all \
75-78: Hadolint DL3044 violation: ENV statement self-references variables being defined.

Lines 75–78 violate Docker best practices by referencing ${CUDA_HOME} and ${LD_LIBRARY_PATH} within the same ENV statement where they are being defined. This was flagged in prior reviews but remains unresolved.

Split into separate ENV statements:
+ ARG CUDA_ARCH_LIST=9.0
+ ENV CUDA_HOME=/usr/local/cuda \
+     TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}"
+ ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH}
+ ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}
- ARG CUDA_ARCH_LIST=9.0
- ENV CUDA_HOME=/usr/local/cuda \
-     PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \
-     LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \
-     TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}"

🧹 Nitpick comments (2)

images/universal/training/py312-cuda128-torch280/Dockerfile (1)

232-235: Verify mamba-ssm install pattern and document rationale.

The two-step install (causal-conv1d then mamba-ssm --no-deps) assumes causal-conv1d satisfies all of mamba-ssm's dependencies. While this pattern can improve determinism, using --no-deps bypasses pip's dependency verification, which carries risk if mamba-ssm has undeclared dependencies or if version incompatibilities arise.

Consider adding a comment explaining why this explicit pattern is necessary (e.g., "avoids resolver conflicts with training-hub") or refactor to a single install with both packages declared.

images/universal/training/py312-cuda128-torch280/Dockerfile.debug (1)

232-316: Complex mamba-ssm build wrapper in debug variant — verify necessity and maintainability.

This file's lines 232–316 introduce a 84-line shell script that wraps the mamba-ssm wheel build with heartbeat logging, timeout enforcement (20 min), and process monitoring. This is significantly more elaborate than the simple 3-line install pattern in the main Dockerfile (lines 232–235).

Questions:

Is the heartbeat/timeout wrapper necessary only for the debug variant, or should it also be in the production Dockerfile?

Does the script properly handle edge cases (e.g., timeouts during ccache setup)?

Is the 20-minute hardcoded timeout appropriate for all platforms/architectures?

Recommendation: Add a comment explaining the rationale (e.g., "Debug builds may run longer due to extended diagnostics; timeout prevents hung builds") and consider whether similar protections are needed in the production build.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ba6e504 and fa75edd.

⛔ Files ignored due to path filters (2)

images/runtime/.DS_Store is excluded by !**/.DS_Store
images/runtime/training/.DS_Store is excluded by !**/.DS_Store

📒 Files selected for processing (5)

.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml (1 hunks)
.tekton/universal-image-py312-cuda128-torch280-push.yaml (1 hunks)
images/universal/training/py312-cuda128-torch280/Dockerfile (5 hunks)
images/universal/training/py312-cuda128-torch280/Dockerfile.debug (1 hunks)
images/universal/training/py312-cuda128-torch280/README.md (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

.tekton/universal-image-py312-cuda128-torch280-push.yaml
.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml
images/universal/training/py312-cuda128-torch280/README.md

🧰 Additional context used

🪛 Hadolint (2.14.0)

images/universal/training/py312-cuda128-torch280/Dockerfile.debug

[error] 75-75: Do not refer to an environment variable within the same ENV statement where it is defined.

(DL3044)

[error] 163-163: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.

(SC2261)

images/universal/training/py312-cuda128-torch280/Dockerfile

[error] 75-75: Do not refer to an environment variable within the same ENV statement where it is defined.

(DL3044)

[error] 163-163: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.

(SC2261)

🔇 Additional comments (3)

images/universal/training/py312-cuda128-torch280/Dockerfile (1)

238-238: Verify that entrypoint-universal.sh exists in the repository.

Line 238 copies ./images/universal/training/py312-cuda128-torch280/entrypoint-universal.sh into the image. This file must exist in the repository for the build to succeed.

images/universal/training/py312-cuda128-torch280/Dockerfile.debug (2)

319-319: Verify that entrypoint-universal.sh exists in the repository.

Line 319 copies ./images/universal/training/py312-cuda128-torch280/entrypoint-universal.sh into the image. This file must exist in the repository for the build to succeed.

157-157: Review comment is incorrect — bitsandbytes 0.48.1 does provide CUDA 12.8 wheels.

Web search confirms that bitsandbytes 0.48.1 (released Oct 2, 2025) includes binaries targeted for CUDA 12.8/12.9 on PyPI. The review's claim that version 0.48.1 "lacks CUDA 12.8 wheels" is factually incorrect.

Additionally, upgrading to the recommended 0.48.2 would be counterproductive: bitsandbytes 0.48.2 (Oct 29, 2025) targets CUDA 11.7–12.6, which does not include CUDA 12.8 support. For a Dockerfile explicitly targeting CUDA 12.8 (as indicated by the filename py312-cuda128-torch280), the current pinned version 0.48.1 is correct.

No changes are needed. The Dockerfile lines 157 and 183 should remain as-is.

Likely an incorrect or invalid review comment.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (4)

images/universal/training/py312-cuda128-torch280/Dockerfile.debug (2)
68-71: ⚠️ CRITICAL: Missing ldconfig call after updating ld.so.conf.d.

Lines 68–71 append NVIDIA library paths to /etc/ld.so.conf.d/nvidia.conf, but do not rebuild the dynamic linker cache. Without running ldconfig, the CUDA libraries will not be discoverable at the preflight check (line 88) or at runtime. The fail-fast check will fail immediately.

Apply this diff to add ldconfig:
  && echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \
  && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \
+ && ldconfig \
  && dnf clean all \
75-78: ⚠️ CRITICAL: Resolve Hadolint DL3044 — ENV self-references variables being defined.

Lines 75–78 violate Docker best practice by referencing ${CUDA_HOME} within the same ENV statement where it is being defined. Additionally, ${LD_LIBRARY_PATH} on line 77 references itself within the same statement. This prevents Docker from properly expanding these values and is flagged by Hadolint DL3044.

Refactor into separate ENV statements:
- ARG CUDA_ARCH_LIST=9.0
- ENV CUDA_HOME=/usr/local/cuda \
-     PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \
-     LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \
-     TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}"
+ ARG CUDA_ARCH_LIST=9.0
+ ENV CUDA_HOME=/usr/local/cuda \
+     TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}"
+ ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH}
+ ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}
images/universal/training/py312-cuda128-torch280/Dockerfile (2)
68-71: ⚠️ CRITICAL: Missing ldconfig call after updating ld.so.conf.d.

Lines 68–71 append NVIDIA library paths to /etc/ld.so.conf.d/nvidia.conf, but do not rebuild the dynamic linker cache. Without running ldconfig, the CUDA libraries will not be discoverable at the preflight check (line 88) or at runtime. The fail-fast check will fail immediately.

Note: This issue was flagged in prior reviews as addressed (commit a8ad8e6) but remains unresolved in the current code.

Apply this diff to add ldconfig:
  && echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \
  && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \
+ && ldconfig \
  && dnf clean all \
75-78: ⚠️ CRITICAL: Resolve Hadolint DL3044 — ENV self-references variables being defined.

Lines 75–78 violate Docker best practice by referencing ${CUDA_HOME} within the same ENV statement where it is being defined. Additionally, ${LD_LIBRARY_PATH} on line 77 references itself within the same statement. This prevents Docker from properly expanding these values and is flagged by Hadolint DL3044.

Note: This issue was flagged in prior reviews as addressed (commit a8ad8e6) but remains unresolved in the current code.

Refactor into separate ENV statements:
- ARG CUDA_ARCH_LIST=9.0
- ENV CUDA_HOME=/usr/local/cuda \
-     PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \
-     LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \
-     TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}"
+ ARG CUDA_ARCH_LIST=9.0
+ ENV CUDA_HOME=/usr/local/cuda \
+     TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}"
+ ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH}
+ ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fa75edd and d4015cd.

⛔ Files ignored due to path filters (2)

images/runtime/.DS_Store is excluded by !**/.DS_Store
images/runtime/training/.DS_Store is excluded by !**/.DS_Store

📒 Files selected for processing (5)

.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml (4 hunks)
.tekton/universal-image-py312-cuda128-torch280-push.yaml (4 hunks)
images/universal/training/py312-cuda128-torch280/Dockerfile (5 hunks)
images/universal/training/py312-cuda128-torch280/Dockerfile.debug (1 hunks)
images/universal/training/py312-cuda128-torch280/README.md (1 hunks)

✅ Files skipped from review due to trivial changes (1)

images/universal/training/py312-cuda128-torch280/README.md

🚧 Files skipped from review as they are similar to previous changes (1)

.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml

🧰 Additional context used

🪛 Hadolint (2.14.0)

images/universal/training/py312-cuda128-torch280/Dockerfile

[error] 75-75: Do not refer to an environment variable within the same ENV statement where it is defined.

(DL3044)

[error] 163-163: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.

(SC2261)

images/universal/training/py312-cuda128-torch280/Dockerfile.debug

[error] 75-75: Do not refer to an environment variable within the same ENV statement where it is defined.

(DL3044)

[error] 163-163: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.

(SC2261)

🔇 Additional comments (4)

.tekton/universal-image-py312-cuda128-torch280-push.yaml (1)

20-21: ✓ Tekton pipeline changes look good.

Timeout, build-platforms parameter declaration, and PLATFORM task parameter changes support multi-platform build orchestration consistently. The hardcoded amd64 default aligns with the Dockerfile's amd64 requirement for FlashAttention (visible in the companion Dockerfile at line 121).

Also applies to: 31-33, 108-111, 199-200

images/universal/training/py312-cuda128-torch280/Dockerfile (2)

232-239: ✓ Simplified 2-step installation approach is solid.

The 2-step pattern for causal-conv1d and mamba-ssm (lines 232–239) is cleaner than the Dockerfile.debug equivalent: it installs both dependencies in a straightforward manner without unnecessary --no-deps flags or dependency-resolution bypass. This improves auditability and allows pip to verify compatibility.

22-22: All required COPY source files exist—no issues found.

Both LICENSE.md and entrypoint-universal.sh are present in the repository at their specified paths. The COPY commands on lines 22 and 242 will execute successfully.

images/universal/training/py312-cuda128-torch280/Dockerfile.debug (1)

22-22: No issues found—required COPY source files are present.

Both LICENSE.md and entrypoint-universal.sh exist at their specified paths in the repository. The Docker build will not fail due to missing COPY source files.

coderabbitai · 2025-11-12T22:47:02Z

images/universal/training/py312-cuda128-torch280/Dockerfile

+# Fail-fast: ensure binary wheels exist for packages that are expensive to build
+RUN mkdir -p /tmp/wheels && \
+    python -m pip download --retries 5 --timeout 120 --only-binary=:all: --no-deps -d /tmp/wheels \
+      numba==0.62.1 bitsandbytes==0.48.1 || \


⚠️ Potential issue | 🔴 Critical

⚠️ CRITICAL: bitsandbytes 0.48.1 lacks official CUDA 12.8 wheels.

Lines 157 and 183 pin bitsandbytes==0.48.1, which does not provide CUDA 12.8 (cu128) wheel variants on PyPI. The fail-fast check at lines 155–158 will fail, and at runtime, the library will fail to load libbitsandbytes_cuda128. Web search confirms that bitsandbytes 0.48.2+ provides cu128 support.

Note: This issue was flagged in prior reviews as addressed (commits 02c9a56, ee7d064) but remains unresolved in the current code.

Upgrade to bitsandbytes==0.48.2 or later in both locations:

- RUN mkdir -p /tmp/wheels && \ - python -m pip download --retries 5 --timeout 120 --only-binary=:all: --no-deps -d /tmp/wheels \ - numba==0.62.1 bitsandbytes==0.48.1 || \ + RUN mkdir -p /tmp/wheels && \ + python -m pip download --retries 5 --timeout 120 --only-binary=:all: --no-deps -d /tmp/wheels \ + numba==0.62.1 bitsandbytes==0.48.2 || \

And:

tensorboard==2.19.0 \ - bitsandbytes==0.48.1 \ + bitsandbytes==0.48.2 \ liger-kernel==0.6.2 \

Also applies to: 183-183

🤖 Prompt for AI Agents

images/universal/training/py312-cuda128-torch280/Dockerfile around lines 157 and 183: bitsandbytes is pinned to 0.48.1 which lacks CUDA 12.8 wheels; update both occurrences to bitsandbytes==0.48.2 (or a later compatible patch) so the Dockerfile and any fail-fast checks reference a release that provides cu128 wheels, and ensure both line 157 and line 183 are changed to the new version.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

images/universal/training/py312-cuda128-torch280/Dockerfile (3)
69-72: ** Split ENV statement to avoid self-referencing variables (Hadolint DL3044).**

This issue was flagged in prior reviews but remains unresolved. The single ENV statement self-references ${PATH} and ${LD_LIBRARY_PATH}, which violates Dockerfile best practices. Split into separate statements:
- ENV CUDA_HOME=/usr/local/cuda \
-     PATH=/usr/local/cuda/bin:$PATH \
-     LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH \
-     TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0"
+ ENV CUDA_HOME=/usr/local/cuda
+ ENV PATH=/usr/local/cuda/bin:${PATH}
+ ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:${CUDA_HOME}/lib64:${LD_LIBRARY_PATH}
+ ENV TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0"
140-140: ** Upgrade bitsandbytes to 0.48.2 or later (critical CUDA 12.8 support).**

bitsandbytes 0.48.1 lacks official CUDA 12.8 wheels on PyPI and will fail at runtime. This critical issue was flagged in multiple prior reviews and commits. Upgrade to 0.48.2 or later, which provides confirmed cu128 support.
-    bitsandbytes==0.48.1 \
+    bitsandbytes==0.48.2 \
Also update line 155 (pip download step) if present.

167-170: ** Remove --no-deps flag and install both packages in single command.**

The --no-deps flag on line 169 bypasses pip's dependency resolution, preventing verification that causal-conv1d and mamba-ssm are compatible. This pattern was flagged in prior reviews. Install both together without --no-deps to enable proper validation:
- # Deterministic 2-step: sub-dep first, then parent without deps
- RUN pip install --no-build-isolation --no-cache-dir causal-conv1d==1.5.3.post1 && \
-     pip install --no-build-isolation --no-cache-dir mamba-ssm==2.2.6.post3 --no-deps && \
+ # Install mamba-ssm with its optional causal-conv1d dependency
+ RUN pip install --no-build-isolation --no-cache-dir causal-conv1d==1.5.3.post1 mamba-ssm==2.2.6.post3 && \
      fix-permissions /opt/app-root -P

🧹 Nitpick comments (2)

images/universal/training/py312-cuda128-torch280/Dockerfile (2)
22-23: Resolve the LICENSE.md TODO before release.

The commented-out COPY instruction should either be re-enabled with the file added to the repo, or removed entirely if licensing is handled elsewhere. Clarify the intent and complete the task.

Do you want me to help you either generate a suitable LICENSE.md file or remove this TODO entirely?

161-166: Remove duplicate TORCH_CUDA_ARCH_LIST and pin versions for build helpers.

Line 165 duplicates the TORCH_CUDA_ARCH_LIST already defined on line 72. Additionally, line 162 uses pip install -U which unpins versions and can cause drift; provide explicit versions for reproducibility.
  # Build helpers to compile PyTorch extensions
- RUN pip install -U pip setuptools wheel ninja cmake
+ RUN pip install --no-cache-dir pip setuptools wheel ninja cmake
  
- # Optional: set GPU archs if you hit arch issues
- ENV TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0"

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d4015cd and 5598440.

📒 Files selected for processing (1)

images/universal/training/py312-cuda128-torch280/Dockerfile (3 hunks)

🧰 Additional context used

🪛 Hadolint (2.14.0)

images/universal/training/py312-cuda128-torch280/Dockerfile

[error] 120-120: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.

(SC2261)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Red Hat Konflux / universal-image-py312-cuda128-torch280-on-pull-request

🔇 Additional comments (2)

images/universal/training/py312-cuda128-torch280/Dockerfile (2)

112-112: Clarify the SDK branch reference.

Line 112 references add-training-hub branch instead of a release branch. Confirm whether this is intentional for development/testing or if it should reference a stable branch (e.g., main or a version tag). The past review comments indicated this was temporary.

173-178: File exists and permissions are correctly configured.

The verification confirms that entrypoint-universal.sh exists in the repository at the expected location and will receive the correct executable permissions (0755) via the --chmod flag during the Docker build. The build will not fail due to a missing file.

Signed-off-by: Brian Gallagher <[email protected]>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

images/universal/training/py312-cuda128-torch280/Dockerfile.debug (1)
75-78: Resolve Hadolint DL3044: Split self-referencing ENV statement.

Lines 75–78 violate Docker best practices (Hadolint DL3044) by referencing ${CUDA_HOME}, ${PATH}, and ${LD_LIBRARY_PATH} within the same ENV statement where they are being defined. This prevents proper variable expansion and is unreliable across Docker versions.

Refactor into separate ENV statements:
- ARG CUDA_ARCH_LIST=9.0
- ENV CUDA_HOME=/usr/local/cuda \
-     PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \
-     LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \
-     TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}"
+ ARG CUDA_ARCH_LIST=9.0
+ ENV CUDA_HOME=/usr/local/cuda \
+     TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}"
+ ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH}
+ ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}

🧹 Nitpick comments (1)

images/universal/training/py312-cuda128-torch280/Dockerfile.debug (1)

234-316: Consider extracting complex wheel-build logic into a separate shell script for maintainability.

Lines 234–316 contain a highly complex multi-line RUN statement that uses printf to generate an inline shell script with ccache logic, timeouts, heartbeat monitoring, and error handling. While the defensive timeout logic and diagnostics are valuable, embedding this complexity makes the Dockerfile harder to audit and maintain.

Suggested improvement (for a follow-up refactor): Extract the script generation and execution into a separate build script file (e.g., build-mamba-wheel.sh), then invoke it from the Dockerfile. This would:

Separate build logic from container configuration

Improve readability and testability

Enable syntax highlighting and linting on the script separately

Simplify the Dockerfile layer

For now, the current approach is functional. No blocking issue, but worth considering for future maintenance.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5598440 and 27d20b4.

⛔ Files ignored due to path filters (2)

images/runtime/.DS_Store is excluded by !**/.DS_Store
images/runtime/training/.DS_Store is excluded by !**/.DS_Store

📒 Files selected for processing (5)

.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml (1 hunks)
.tekton/universal-image-py312-cuda128-torch280-push.yaml (1 hunks)
images/universal/training/py312-cuda128-torch280/Dockerfile (3 hunks)
images/universal/training/py312-cuda128-torch280/Dockerfile.debug (1 hunks)
images/universal/training/py312-cuda128-torch280/README.md (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

.tekton/universal-image-py312-cuda128-torch280-push.yaml
images/universal/training/py312-cuda128-torch280/README.md

🧰 Additional context used

🪛 Hadolint (2.14.0)

images/universal/training/py312-cuda128-torch280/Dockerfile

[error] 120-120: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.

(SC2261)

images/universal/training/py312-cuda128-torch280/Dockerfile.debug

[error] 75-75: Do not refer to an environment variable within the same ENV statement where it is defined.

(DL3044)

[error] 163-163: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.

(SC2261)

🔇 Additional comments (6)

.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml (1)

21-22: Add pipeline timeout for long-running builds.

The 9-hour timeout is appropriate for CUDA toolchain compilation and FlashAttention builds. This aligns with the PR objective and matches similar timeout settings across related Tekton pipelines.

images/universal/training/py312-cuda128-torch280/Dockerfile (4)

112-112: Verify SDK git branch is finalized before merge.

Line 112 references a feature branch: git+https://github.com/briangallagher/sdk@add-training-hub. Ensure this branch name reflects the final intended state; temporary or feature-specific branches should be resolved to a stable tag or main branch before release.

Also applies to: 112-112

173-173: Entrypoint wrapper wiring is correct and aligns with debug variant.

The addition of a POSIX entrypoint wrapper at line 173 enables both workbench and headless runtime modes. This mirrors the approach in Dockerfile.debug and is well-motivated by the PR objective to support Training Hub 0.3.0.

Also applies to: 173-173

140-140: The review comment is factually incorrect and should be disregarded.

bitsandbytes 0.48.1 was released (Oct 2, 2025) and the published PyPI wheels are built to be compatible with CUDA 12.8 (cu128). The Dockerfile correctly pins bitsandbytes==0.48.1, which already provides the necessary cu128 support. No upgrade is required.

The original assertion that bitsandbytes 0.48.1 "lacks official CUDA 12.8 (cu128) wheel variants on PyPI" is inaccurate. The code is correct as-is.

Likely an incorrect or invalid review comment.

168-170: The review comment incorrectly characterizes this pattern as risky.

The current code is functionally safe. mamba-ssm 2.2.6.post3 does not hard-require causal-conv1d but exposes it as an optional extra, with a recommended constraint of causal-conv1d>=1.4.0. Since causal-conv1d==1.5.3.post1 is installed on line 168 before mamba-ssm, the optional dependency is already satisfied. The --no-deps flag does skip dependency resolution for mamba-ssm, but this does not create a broken state because:

causal-conv1d is pre-installed and compatible

mamba-ssm will function correctly with or without causal-conv1d (it's optional, not required)

No validation error occurs

The suggested refactoring (combining both into a single install command) is valid and arguably cleaner, but it addresses a stylistic preference, not a critical issue. The current code works as intended.

Likely an incorrect or invalid review comment.

images/universal/training/py312-cuda128-torch280/Dockerfile.debug (1)

319-319: Verify the actual build context and COPY path behavior before applying this change.

The review comment's reasoning contains a critical error. If the build context is the repository root (as claimed), then COPY entrypoint-universal.sh in the main Dockerfile would fail—the file is not at the repo root but at images/universal/training/py312-cuda128-torch280/entrypoint-universal.sh. The full path used in the debug Dockerfile would actually be the one that works from that context.

Additionally, Dockerfile.debug is not referenced in any Tekton pipeline configuration—only the main Dockerfile is built. Before assuming the debug variant should match the main variant, confirm:

What is the actual build context used when buildah invokes the Dockerfile?

Does the main Dockerfile's relative path actually work in practice?

Is Dockerfile.debug intended for CI/CD builds or manual/development use only?

If both need to coexist, ensure the paths work correctly for their actual build contexts rather than assuming consistency without verification.

coderabbitai · 2025-11-13T09:03:55Z

images/universal/training/py312-cuda128-torch280/Dockerfile.debug

+# Fail-fast: ensure binary wheels exist for packages that are expensive to build
+RUN mkdir -p /tmp/wheels && \
+    python -m pip download --retries 5 --timeout 120 --only-binary=:all: --no-deps -d /tmp/wheels \
+      numba==0.62.1 bitsandbytes==0.48.1 || \


⚠️ Potential issue | 🔴 Critical

⚠️ Critical: bitsandbytes 0.48.1 lacks CUDA 12.8 wheels — fail-fast check will fail.

Line 157 attempts to download bitsandbytes==0.48.1, which does not provide CUDA 12.8 (cu128) wheel variants on PyPI. The fail-fast check at lines 155–158 will fail, preventing the image build from proceeding. Additionally, line 183 pins the same version for the main install, so both must be updated.

Upgrade both occurrences to bitsandbytes==0.48.2 or later (confirmed cu128 support):

- RUN mkdir -p /tmp/wheels && \ - python -m pip download --retries 5 --timeout 120 --only-binary=:all: --no-deps -d /tmp/wheels \ - numba==0.62.1 bitsandbytes==0.48.1 || \ + RUN mkdir -p /tmp/wheels && \ + python -m pip download --retries 5 --timeout 120 --only-binary=:all: --no-deps -d /tmp/wheels \ + numba==0.62.1 bitsandbytes==0.48.2 || \

And at line 183:

tensorboard==2.19.0 \ - bitsandbytes==0.48.1 \ + bitsandbytes==0.48.2 \ liger-kernel==0.6.2 \

Also applies to: 157-157

🤖 Prompt for AI Agents

In images/universal/training/py312-cuda128-torch280/Dockerfile.debug around line 157 (and also update the matching pin at line 183), the Dockerfile currently installs bitsandbytes==0.48.1 which lacks CUDA 12.8 (cu128) wheels and causes the fail-fast check to fail; change both pins to bitsandbytes==0.48.2 (or later confirmed to include cu128 wheels) so the fail-fast check and main install succeed, keeping the version consistent in both locations.

Signed-off-by: Brian Gallagher <[email protected]>

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

images/universal/training/py312-cuda128-torch280/Dockerfile (2)
69-72: I need to verify the Hadolint DL3044 violation claim and the proposed solution.

Based on my verification, Docker will not expand a variable within the same ENV statement where it is defined, which is exactly what Hadolint DL3044 flags. The review comment correctly identifies this issue—the original code attempts to reference $PATH and $LD_LIBRARY_PATH within the same ENV statement where they're being modified.

However, the proposed diff contains a problem: it adds ${CUDA_HOME}/bin to the PATH which wasn't in the original code. The original only references /usr/local/cuda/bin:$PATH, while the diff proposes /usr/local/cuda/bin:${CUDA_HOME}/bin:${PATH} (duplicating the CUDA path).

Corrected approach: Split ENV into separate statements to fix DL3044, without changing the path values.
-ENV CUDA_HOME=/usr/local/cuda \
-    PATH=/usr/local/cuda/bin:$PATH \
-    LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH \
-    TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0"
+ENV CUDA_HOME=/usr/local/cuda \
+    TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0"
+ENV PATH=/usr/local/cuda/bin:${PATH}
+ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH}
This corrects the DL3044 violation while preserving the original intended behavior: appending to existing PATH and LD_LIBRARY_PATH across separate ENV statements, which properly enables variable expansion.

120-158: Upgrade bitsandbytes to 0.48.2 for CUDA 12.8 support.

bitsandbytes 0.48.1 does not publish cu128-tagged wheels on PyPI. Line 140 pins 0.48.1, which will fail at runtime in this CUDA 12.8 environment. bitsandbytes v0.48.2 provides prebuilt wheels supporting CUDA 12.8.
     tensorboard==2.19.0 \
-    bitsandbytes==0.48.1 \
+    bitsandbytes==0.48.2 \
     liger-kernel==0.6.2 \

♻️ Duplicate comments (3)

images/universal/training/py312-cuda128-torch280/Dockerfile.debug (3)
68-71: Add ldconfig call to rebuild dynamic linker cache after updating /etc/ld.so.conf.d/nvidia.conf. Lines 68–71 append NVIDIA library paths to the ldconfig configuration file, but the dynamic linker cache is not rebuilt. Without running ldconfig, the CUDA libraries at those paths will not be discoverable at line 88 (ldconfig grep check) or at runtime.

Apply this diff to rebuild the cache:
  && echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \
  && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \
+ && ldconfig \
  && dnf clean all \
155-158: Critical: bitsandbytes 0.48.1 fails fail-fast check due to missing CUDA 12.8 wheels. Lines 157 pins bitsandbytes==0.48.1, which does not provide cu128-tagged wheel variants on PyPI. The fail-fast check will exit with status 1 during the build, preventing image completion. This issue was flagged in prior reviews but remains unresolved in the debug variant.

Upgrade to bitsandbytes==0.48.2 or later:
- RUN mkdir -p /tmp/wheels && \
-     python -m pip download --retries 5 --timeout 120 --only-binary=:all: --no-deps -d /tmp/wheels \
-       numba==0.62.1 bitsandbytes==0.48.1 || \
+ RUN mkdir -p /tmp/wheels && \
+     python -m pip download --retries 5 --timeout 120 --only-binary=:all: --no-deps -d /tmp/wheels \
+       numba==0.62.1 bitsandbytes==0.48.2 || \
Also update line 183 to match:
      tensorboard==2.19.0 \
-     bitsandbytes==0.48.1 \
+     bitsandbytes==0.48.2 \
      liger-kernel==0.6.2 \
75-78: Hadolint DL3044 violation: ENV statement self-references variables being defined. Lines 76–77 reference ${CUDA_HOME}, ${PATH}, and ${LD_LIBRARY_PATH} within the same ENV statement where they are being set, violating Docker best practice. Split into separate ENV statements to allow proper variable expansion.

Apply this diff:
-ARG CUDA_ARCH_LIST=9.0
-ENV CUDA_HOME=/usr/local/cuda \
-    PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \
-    LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \
-    TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}"
+ARG CUDA_ARCH_LIST=9.0
+ENV CUDA_HOME=/usr/local/cuda \
+    TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}"
+ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH}
+ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}

🧹 Nitpick comments (1)

images/universal/training/py312-cuda128-torch280/Dockerfile.debug (1)

232-316: Deterministic 2-step wheel build with comprehensive logging and timeout is a strong pattern. The build approach (lines 234–316) combines ccache caching, parallel compilation, detailed logging, heartbeat monitoring, and a 22-minute timeout with 120-second kill grace period. This significantly improves debuggability and prevents indefinite hangs. The inline bash script is verbose but appropriate for a debug variant.

As a future nice-to-have, consider extracting the long bash script into a separate shell file (e.g., build_mamba_wheel.sh) within the repository and COPY it into the image to improve readability and testability. This is not urgent for the current PR but would simplify future maintenance.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 27d20b4 and 2f92670.

📒 Files selected for processing (5)

.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml (1 hunks)
.tekton/universal-image-py312-cuda128-torch280-push.yaml (1 hunks)
images/universal/training/py312-cuda128-torch280/Dockerfile (3 hunks)
images/universal/training/py312-cuda128-torch280/Dockerfile.debug (1 hunks)
images/universal/training/py312-cuda128-torch280/README.md (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

images/universal/training/py312-cuda128-torch280/README.md
.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml

🧰 Additional context used

🪛 Hadolint (2.14.0)

images/universal/training/py312-cuda128-torch280/Dockerfile.debug

[error] 75-75: Do not refer to an environment variable within the same ENV statement where it is defined.

(DL3044)

[error] 163-163: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.

(SC2261)

images/universal/training/py312-cuda128-torch280/Dockerfile

[error] 120-120: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.

(SC2261)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Red Hat Konflux / universal-image-py312-cuda128-torch280-on-pull-request

🔇 Additional comments (7)

.tekton/universal-image-py312-cuda128-torch280-push.yaml (1)

20-21: Reasonable timeout for long-running image builds. The 9h pipeline timeout accommodates compilation steps (causal-conv1d, mamba-ssm) while the build-container task itself is capped at 8h. This prevents runaway builds while allowing sufficient time for the full pipeline to complete.
images/universal/training/py312-cuda128-torch280/Dockerfile (4)
22-23: License file handling addressed via TODO comment. The LICENSE.md copy is now commented out with a TODO note. This resolves the build-time failure from previous reviews where the file was missing. Ensure the TODO is tracked and the actual license file is added before production release.

112-115: Package and build tooling updates look reasonable. Line 112 switched to sdk@add-training-hub branch and line 115 commented out the pip install ninja step (relying on the later install at line 162). These changes support the Training Hub 0.3.0 upgrade goal.

167-170: Deterministic 2-step build for mamba-ssm extensions. The two-step installation (causal-conv1d first, then mamba-ssm with --no-deps) aims to avoid pip resolver conflicts. However, this pattern bypasses dependency resolution, which is risky if mamba-ssm has undeclared optional dependencies. Ensure via testing that mamba-ssm functions correctly with only causal-conv1d installed and no other implicit dependencies.

To improve defensibility, add a brief comment explaining why this 2-step pattern is necessary:
+# 2-step deterministic build: install causal-conv1d as a pre-requisite,
+# then mamba-ssm with --no-deps to avoid pip resolver conflicts with training-hub
 RUN pip install --no-build-isolation --no-cache-dir causal-conv1d==1.5.3.post1 && \
     pip install --no-build-isolation --no-cache-dir mamba-ssm==2.2.6.post3 --no-deps && \
     fix-permissions /opt/app-root -P
173-178: File exists; review comment concern is resolved.

The verification confirms that ./images/universal/training/py312-cuda128-torch280/entrypoint-universal.sh exists in the repository. The COPY command on line 173 will succeed at build time. The entrypoint wrapper setup is sound and no issues remain.
images/universal/training/py312-cuda128-torch280/Dockerfile.debug (2)

318-324: Entrypoint wrapper setup aligns with runtime variant and is consistent. The COPY (line 319), ENTRYPOINT (line 323), and CMD (line 324) setup mirrors the runtime Dockerfile, preserving workbench behavior by default while enabling headless mode when a command is provided. Verify that entrypoint-universal.sh exists in the repository (same concern as runtime variant).

22-22: No issues found. The LICENSE.md file exists in the repository, so the COPY command on line 22 will execute successfully. The code is correct as-is.

openshift-ci bot requested review from astefanutti and szaher November 7, 2025 09:51

coderabbitai bot reviewed Nov 7, 2025

View reviewed changes

images/universal/training/py312-cuda128-torch280/Dockerfile Outdated Show resolved Hide resolved

briangallagher force-pushed the bump-training-hub branch 2 times, most recently from a5aa67d to 3416b2c Compare November 7, 2025 10:18

astefanutti reviewed Nov 7, 2025

View reviewed changes

coderabbitai bot reviewed Nov 7, 2025

View reviewed changes

images/universal/training/py312-cuda128-torch280/Dockerfile Show resolved Hide resolved

astefanutti approved these changes Nov 7, 2025

View reviewed changes

openshift-ci bot assigned astefanutti Nov 7, 2025

openshift-ci bot added the lgtm label Nov 7, 2025

openshift-ci bot added the approved label Nov 7, 2025

openshift-ci bot removed the lgtm label Nov 7, 2025

briangallagher force-pushed the bump-training-hub branch from acaff12 to e24db03 Compare November 7, 2025 17:10

coderabbitai bot reviewed Nov 7, 2025

View reviewed changes

briangallagher force-pushed the bump-training-hub branch 2 times, most recently from da73149 to 9793396 Compare November 8, 2025 19:56

coderabbitai bot reviewed Nov 8, 2025

View reviewed changes

briangallagher force-pushed the bump-training-hub branch from 9793396 to ad676de Compare November 9, 2025 08:38

coderabbitai bot reviewed Nov 9, 2025

View reviewed changes

images/universal/training/py312-cuda128-torch280/Dockerfile Show resolved Hide resolved

briangallagher force-pushed the bump-training-hub branch 2 times, most recently from d71708d to 9b73c61 Compare November 9, 2025 17:48

coderabbitai bot reviewed Nov 9, 2025

View reviewed changes

briangallagher force-pushed the bump-training-hub branch 2 times, most recently from ff5f03b to da81998 Compare November 10, 2025 10:14

coderabbitai bot reviewed Nov 10, 2025

View reviewed changes

coderabbitai bot reviewed Nov 11, 2025

View reviewed changes

briangallagher force-pushed the bump-training-hub branch from ad29950 to 7553cf9 Compare November 11, 2025 13:55

briangallagher force-pushed the bump-training-hub branch from 7553cf9 to 8927298 Compare November 11, 2025 15:10

coderabbitai bot reviewed Nov 11, 2025

View reviewed changes

briangallagher force-pushed the bump-training-hub branch from 8927298 to 02c9a56 Compare November 11, 2025 15:32

coderabbitai bot reviewed Nov 11, 2025

View reviewed changes

briangallagher force-pushed the bump-training-hub branch 3 times, most recently from 57025bf to d62c537 Compare November 12, 2025 07:53

coderabbitai bot reviewed Nov 12, 2025

View reviewed changes

briangallagher force-pushed the bump-training-hub branch 2 times, most recently from fde6736 to ba6e504 Compare November 12, 2025 12:34

coderabbitai bot reviewed Nov 12, 2025

View reviewed changes

briangallagher force-pushed the bump-training-hub branch from ba6e504 to fa75edd Compare November 12, 2025 21:45

coderabbitai bot reviewed Nov 12, 2025

View reviewed changes

briangallagher force-pushed the bump-training-hub branch from fa75edd to d4015cd Compare November 12, 2025 22:40

coderabbitai bot reviewed Nov 12, 2025

View reviewed changes

coderabbitai bot reviewed Nov 13, 2025

View reviewed changes

bump universal image to support training hub 0.3.0

e29c1ca

Signed-off-by: Brian Gallagher <[email protected]>

briangallagher force-pushed the bump-training-hub branch from 5598440 to 27d20b4 Compare November 13, 2025 08:57

coderabbitai bot reviewed Nov 13, 2025

View reviewed changes

briangallagher force-pushed the bump-training-hub branch from 27d20b4 to c5ae569 Compare November 13, 2025 09:45

add timeouts

2f92670

Signed-off-by: Brian Gallagher <[email protected]>

briangallagher force-pushed the bump-training-hub branch from c5ae569 to 2f92670 Compare November 13, 2025 09:46

coderabbitai bot reviewed Nov 13, 2025

View reviewed changes

This was referenced Nov 13, 2025

move konflux universal to multiplaform and update docker image #512

Closed

fix the universal image build #515

Merged

bump universal image to support training hub 0.3.0 #502

Are you sure you want to change the base?

bump universal image to support training hub 0.3.0 #502

Conversation

briangallagher commented Nov 7, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

astefanutti Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

briangallagher Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

openshift-ci bot commented Nov 7, 2025

Uh oh!

astefanutti commented Nov 7, 2025

Uh oh!

openshift-ci bot commented Nov 7, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

briangallagher commented Nov 7, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

briangallagher commented Nov 11, 2025

Uh oh!

briangallagher commented Nov 11, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

briangallagher commented Nov 7, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 7, 2025 •

edited

Loading